Python using BeautifulSoup4 to modify the content of the web page of the record of the actual combat

Recently there is a small project, the need to crawl the page on the corresponding resource data, saved locally, and then the original HTML source file saved, the content of the HTML page to modify the whole of certain tags to replace .

For this type of need to manipulate HTML, there is nothing more convenient than aBeautifulSoup4 The library is up.

The HTML code for the sample is as follows:

<html>
<body>
    <a class="videoslide" href="/wp-content/uploads/1020/" rel="external nofollow"  rel="external nofollow" >
       <img src="/wp-content/uploads/1020/1381824922_zy_compress.JPG" data-zy-media-/>
    </a>
    <a href="/wp-content/uploads/1020/first_1381824798.JPG" rel="external nofollow"  rel="external nofollow" >
       <img data-zy-media- src="/wp-content/uploads/1020/first_1381824798_zy_compress.JPG"/></a>
    <a href="/wp-content/uploads/1020/second_1381824796.jpg" rel="external nofollow"  rel="external nofollow" >
       <img data-zy-media- src="/wp-content/uploads/1020/second_1381824796_zy_compress.jpg"/>
    </a>
    <a href="/wp-content/uploads/1020/third.jpg" rel="external nofollow"  rel="external nofollow" >
       <img data-zy-media- src="/wp-content/uploads/1020/third_zy_compress.jpg"/>
    </a>
</body>
</html>

The main things included here are<a > Tags.<a > The tag is embedded inside the<img > Tagged with<a class="videoslide"> that identifies the tag as actually being able to play the animation. It is necessary to play the animation according to theclass="videoslide" to determine whether it would be a good idea to put the entire<a > Tabs changed to the player's<video > tags, there will be noclass="videoslide" (used form a nominal expression)<a > Labels changed to<figure> Tags.

That is to say, it will be the same as the one with the<a class="videoslide" ...><img ... /></a> Labels changed to

<div class="video">
<video controls width="100%" poster="Image address for video link.jpg">
    <source src="Static address of the video file.mp4" type="video/mp4" />
    Your browser does not supportH5video，Please useChrome/Firefox/Edgebrowser (software)。
</video>
</div>

commander-in-chief (military)<a ....><img .../></a> Labels changed to

<figure>
    < img src="Image address_compressed.jpg" data-zy-media->
    <figcaption>textual criticism（If there is）</figcaption>
</figure>

Here we find the tag through BeautifulSoup4's select() method, get the tag and tag attribute value through get() method, and replace the tag through replaceWith, the specific code is as follows:

First install the BeautifulSoup4 library. The BeautifulSoup4 library depends on the lxml library, so you need to install the lxml library as well.

pip install bs4
pip install lxml

The specific code implementation is as follows:

import os
from bs4 import BeautifulSoup
htmlstr='<html><body>' \
        '<a class="videoslide" href="/wp-content/uploads/1020/" rel="external nofollow"  rel="external nofollow" >' \
        '<img src="/wp-content/uploads/1020/1381824922_zy_compress.JPG" data-zy-media-/></a>' \
        '<a href="/wp-content/uploads/1020/first_1381824798.JPG" rel="external nofollow"  rel="external nofollow" >' \
        '<img data-zy-media- src="/wp-content/uploads/1020/first_1381824798_zy_compress.JPG"/></a>' \
        '<a href="/wp-content/uploads/1020/second_1381824796.jpg" rel="external nofollow"  rel="external nofollow" >' \
        '<img data-zy-media- src="/wp-content/uploads/1020/second_1381824796_zy_compress.jpg"/></a>' \
        '<a href="/wp-content/uploads/1020/third.jpg" rel="external nofollow"  rel="external nofollow" >' \
        '<img data-zy-media- src="/wp-content/uploads/1020/third_zy_compress.jpg"/></a>' \
        '</body></html>'

def procHtml(htmlstr):
    soup = BeautifulSoup(htmlstr, 'lxml')
    a_tags=('a')
    for a_tag in a_tags:
        a_tag_src = a_tag.get('href')
        a_tag_filename = (a_tag_src)
        a_tag_path = ('src', a_tag_filename)
        a_tag['href']=a_tag_path
        next_tag=a_tag.next
        # Determine if it's a video or an image, if the a tag has class="videoslide" it's a video or else it's an image.
        if a_tag.get('class') and 'videoslide'==a_tag.get('class')[0]:
            # Processing video files
            media_id = next_tag.get('data-zy-media-id')
            if media_id:
                media_url = '/travel/show_media/' + str(media_id)+'.mp4'
                media_filename = (media_url)
                media_path = ('src', media_filename)
                # Replace tags with a tags
                video_html = '<div class=\"video\"><video controls width = \"100%\" poster = \"' + a_tag_path + '\" ><source src = \"' + media_path + '\" type = \"video/mp4\" /> Your browser does not supportH5video，Please useChrome / Firefox / Edgebrowser (software)。 </video></div>'
                video_soup = BeautifulSoup(video_html, 'lxml')
                a_tag.replaceWith(video_soup.div)
        else:
            # Get picture information
            if 'img'==next_tag.name:
                img_src=next_tag.get('src')
                # Determine if the path is a local resource data:image and file.
                if img_src.find('data:image') == -1 and img_src.find('file:') == -1:
                    img_filename = (img_src)
                    img_path = ('src', img_filename)
                    # Replace <figure><img> tags with a tags
                    figcaption=''
                    figure_html='<figure><img src=\"'+img_path+'\" data-zy-media-id=\"'+a_tag_path+'\"><figcaption>'+figcaption+'</figcaption></figure>'
                    figure_soup = BeautifulSoup(figure_html, 'lxml')
                    a_tag.replaceWith(figure_soup.figure)
    html_content = [0]
    return html_content

if __name__ == '__main__':
    pro_html_str=procHtml(htmlstr)
    print(pro_html_str)

Results.

<html>
<body>
<div class="video">
<video controls="" poster="src\" width="100%">
<source src="src\zy_location_201310151613422786.mp4" type="video/mp4"/> Your browser does not supportH5video，Please useChrome / Firefox / Edgebrowser (software)。 
</video>
</div>
<figure>
<img data-zy-media- src="src\first_1381824798_zy_compress.JPG"/>
<figcaption></figcaption>
</figure>
<figure>
<img data-zy-media- src="src\second_1381824796_zy_compress.jpg"/>
<figcaption></figcaption></figure>
<figure>
<img data-zy-media- src="src\third_zy_compress.jpg"/>
<figcaption></figcaption>
</figure>
</body>
</html>

summarize

This article on the use of Python BeautifulSoup4 to modify the content of the page is introduced to this article, more related to Python BeautifulSoup4 to modify the content of the page, please search for my previous posts or continue to browse the following related articles I hope that you will support me in the future more!