Recently there is a small project, the need to crawl the page on the corresponding resource data, saved locally, and then the original HTML source file saved, the content of the HTML page to modify the whole of certain tags to replace .
For this type of need to manipulate HTML, there is nothing more convenient than aBeautifulSoup4 The library is up.
The HTML code for the sample is as follows:
<html> <body> <a class="videoslide" href="/wp-content/uploads/1020/" rel="external nofollow" rel="external nofollow" > <img src="/wp-content/uploads/1020/1381824922_zy_compress.JPG" data-zy-media-/> </a> <a href="/wp-content/uploads/1020/first_1381824798.JPG" rel="external nofollow" rel="external nofollow" > <img data-zy-media- src="/wp-content/uploads/1020/first_1381824798_zy_compress.JPG"/></a> <a href="/wp-content/uploads/1020/second_1381824796.jpg" rel="external nofollow" rel="external nofollow" > <img data-zy-media- src="/wp-content/uploads/1020/second_1381824796_zy_compress.jpg"/> </a> <a href="/wp-content/uploads/1020/third.jpg" rel="external nofollow" rel="external nofollow" > <img data-zy-media- src="/wp-content/uploads/1020/third_zy_compress.jpg"/> </a> </body> </html>
The main things included here are<a >
Tags.<a >
The tag is embedded inside the<img >
Tagged with<a class="videoslide">
that identifies the tag as actually being able to play the animation. It is necessary to play the animation according to theclass="videoslide"
to determine whether it would be a good idea to put the entire<a >
Tabs changed to the player's<video >
tags, there will be noclass="videoslide"
(used form a nominal expression)<a >
Labels changed to<figure>
Tags.
That is to say, it will be the same as the one with the<a class="videoslide" ...><img ... /></a>
Labels changed to
<div class="video"> <video controls width="100%" poster="Image address for video link.jpg"> <source src="Static address of the video file.mp4" type="video/mp4" /> Your browser does not supportH5video,Please useChrome/Firefox/Edgebrowser (software)。 </video> </div>
commander-in-chief (military)<a ....><img .../></a>
Labels changed to
<figure> < img src="Image address_compressed.jpg" data-zy-media-> <figcaption>textual criticism(If there is)</figcaption> </figure>
Here we find the tag through BeautifulSoup4's select() method, get the tag and tag attribute value through get() method, and replace the tag through replaceWith, the specific code is as follows:
First install the BeautifulSoup4 library. The BeautifulSoup4 library depends on the lxml library, so you need to install the lxml library as well.
pip install bs4 pip install lxml
The specific code implementation is as follows:
import os from bs4 import BeautifulSoup htmlstr='<html><body>' \ '<a class="videoslide" href="/wp-content/uploads/1020/" rel="external nofollow" rel="external nofollow" >' \ '<img src="/wp-content/uploads/1020/1381824922_zy_compress.JPG" data-zy-media-/></a>' \ '<a href="/wp-content/uploads/1020/first_1381824798.JPG" rel="external nofollow" rel="external nofollow" >' \ '<img data-zy-media- src="/wp-content/uploads/1020/first_1381824798_zy_compress.JPG"/></a>' \ '<a href="/wp-content/uploads/1020/second_1381824796.jpg" rel="external nofollow" rel="external nofollow" >' \ '<img data-zy-media- src="/wp-content/uploads/1020/second_1381824796_zy_compress.jpg"/></a>' \ '<a href="/wp-content/uploads/1020/third.jpg" rel="external nofollow" rel="external nofollow" >' \ '<img data-zy-media- src="/wp-content/uploads/1020/third_zy_compress.jpg"/></a>' \ '</body></html>' def procHtml(htmlstr): soup = BeautifulSoup(htmlstr, 'lxml') a_tags=('a') for a_tag in a_tags: a_tag_src = a_tag.get('href') a_tag_filename = (a_tag_src) a_tag_path = ('src', a_tag_filename) a_tag['href']=a_tag_path next_tag=a_tag.next # Determine if it's a video or an image, if the a tag has class="videoslide" it's a video or else it's an image. if a_tag.get('class') and 'videoslide'==a_tag.get('class')[0]: # Processing video files media_id = next_tag.get('data-zy-media-id') if media_id: media_url = '/travel/show_media/' + str(media_id)+'.mp4' media_filename = (media_url) media_path = ('src', media_filename) # Replace tags with a tags video_html = '<div class=\"video\"><video controls width = \"100%\" poster = \"' + a_tag_path + '\" ><source src = \"' + media_path + '\" type = \"video/mp4\" /> Your browser does not supportH5video,Please useChrome / Firefox / Edgebrowser (software)。 </video></div>' video_soup = BeautifulSoup(video_html, 'lxml') a_tag.replaceWith(video_soup.div) else: # Get picture information if 'img'==next_tag.name: img_src=next_tag.get('src') # Determine if the path is a local resource data:image and file. if img_src.find('data:image') == -1 and img_src.find('file:') == -1: img_filename = (img_src) img_path = ('src', img_filename) # Replace <figure><img> tags with a tags figcaption='' figure_html='<figure><img src=\"'+img_path+'\" data-zy-media-id=\"'+a_tag_path+'\"><figcaption>'+figcaption+'</figcaption></figure>' figure_soup = BeautifulSoup(figure_html, 'lxml') a_tag.replaceWith(figure_soup.figure) html_content = [0] return html_content if __name__ == '__main__': pro_html_str=procHtml(htmlstr) print(pro_html_str)
Results.
<html> <body> <div class="video"> <video controls="" poster="src\" width="100%"> <source src="src\zy_location_201310151613422786.mp4" type="video/mp4"/> Your browser does not supportH5video,Please useChrome / Firefox / Edgebrowser (software)。 </video> </div> <figure> <img data-zy-media- src="src\first_1381824798_zy_compress.JPG"/> <figcaption></figcaption> </figure> <figure> <img data-zy-media- src="src\second_1381824796_zy_compress.jpg"/> <figcaption></figcaption></figure> <figure> <img data-zy-media- src="src\third_zy_compress.jpg"/> <figcaption></figcaption> </figure> </body> </html>
summarize
This article on the use of Python BeautifulSoup4 to modify the content of the page is introduced to this article, more related to Python BeautifulSoup4 to modify the content of the page, please search for my previous posts or continue to browse the following related articles I hope that you will support me in the future more!