I. Introduction to the module
Beautiful Soup is a Python library that extracts data from HTML or XML files. It enables you to navigate, find, and modify documents in the usual way with your favorite converter. Beautiful Soup will save you hours or even days of work.
II. Utilization of methodologies
1. Introduction of modules
# Introduced html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="/elsie" rel="external nofollow" class="sister" >Elsie</a>, <a href="/lacie" rel="external nofollow" class="sister" >Lacie</a> and <a href="/tillie" rel="external nofollow" class="sister" >Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, '')
Four parsers
2. A few simple ways to browse structured data
#GetTag, in layman's terms, is a single tag in HTML
#GetTag, in layman's terms, is a single tag in HTML # Get the entire title tag field: <title>The Dormouse's story</title> # Get title tag name: title # Get the name of the parent tag of the title: head # Get the first p tag field: <p class="title"><b>The Dormouse's story</b></p> ['class'] # Get the value of the class attribute in the first p: title ('class') # Equivalent to the above # Get the first a tag field soup.find_all('a') # Get all a tag fields () # Get the field with attribute id value link3 ['class'] = "newClass" # These attributes and contents etc. can be modified del ['class'] # It's also possible to delete this attribute ('a').get('id') # Get the value of the id attribute in the a tag whose class value is story # gaintitleTagged values :The Dormouse's story
III. Specific utilization
1、Get the label with the specified attribute
Method 1:Getting a single property soup.find_all('div',) # Get all div tags with id=even attribute soup.find_all('div',attrs={'id':"even"}) # With the same effect as above # Method II: soup.find_all('div',,class_="square") # Get all div tags with id=even and class=square attribute. soup.find_all('div',attrs={"id":"even","class":"square"}) # Same effect as above
2、Get the attribute value of the label
Method 1:Extraction by subscripting for link in soup.find_all('a'): print(link['href']) //equivalent to print(('href')) Method II:utilizationattrsparameter extraction for link in soup.find_all('a'): print(['href'])
3、Get the content in the label
divs = soup.find_all('div') # Get all div tags for div in divs: # Loop over each div in the div a = div.find_all('a')[0] # Find the first a tag in a div tag print() # Output the contents of the a tag If the result is not displayed correctly,can be converted tolistlistings
4、stripped_strings
Remove \n line breaks and other content stripped_strings
divs = soup.find_all('div') for div in divs: infos = list(div.stripped_strings) # Remove spaces, line breaks, etc. bring(infos)
IV. Output
1, formatted output prettify ()
prettify() method will Beautiful Soup document tree formatted in Unicode encoding output, each XML/HTML tags are exclusive of a line
markup = '<a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >I linked to <i></i></a>' soup = BeautifulSoup(markup) () # '<html>\n <head>\n </head>\n <body>\n <a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >\n...' print(()) # <html> # <head> # </head> # <body> # <a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" > # I linked to # <i> # # </i> # </a> # </body> # </html>
2、get_text()
If you only want to get the text contained in the tag, then you can call the get_text() method, which gets all the text contained in the tag, including the content of the descendant tags, and returns the result as a Unicode string.
markup = '<a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >\nI linked to <i></i>\n</a>' soup = BeautifulSoup(markup) soup.get_text() u'\nI linked to \n' .get_text() u''
to this article on the use of Python Beautiful Soup module tutorials explain the article is introduced to this, more related to Python Beautiful Soup content please search for my previous posts or continue to browse the following related articles I hope you will support me in the future more!