I. Installation
- Bautiful Soup is a third-party library, so it requires a separate download, which is very simple
- Since BS4 relies on a document parser when parsing pages, you also need to install lxml as a parsing library
- Python also comes with a document parsing library, but it's a little slower than lxml.
pip install bs4 pip install lxml pip install html5lib
II. Analysis
- Indicates the parser used to parse the document
- The parser can also be lxml or html5lib.
html = ''' <div class="modal-dialog"> <div class="modal-content"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal">×</button> <h4 class="modal-title">Modal title</h4> </div> <div class="modal-body"> ... </div> <div class="modal-footer"> <a href="#" rel="external nofollow" rel="external nofollow" class="btn btn-default" data-dismiss="modal">Close</a> <a href="#" rel="external nofollow" rel="external nofollow" class="btn btn-primary">Save</a> </div> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, '') #prettify() for formatting output html/xml documents print(())
III. Analysis of External Documents
- external documents, you can also read them by opening the
from bs4 import BeautifulSoup fp = open('html_doc.html', encoding='utf8') soup = BeautifulSoup(fp, 'lxml')
IV. Label selector
- Tags (Tag) are the basic elements that make up an HTML document.
- The desired content can be extracted by tag name and tag attributes
from bs4 import BeautifulSoup soup = BeautifulSoup('<p class="name nickname user"><b>i am autofelix</b></p>', '') #Get the html code for the entire p tag print() # Getting the b-tag print() # Get the content of the p tag, using the NavigableString class string, text, get_text() print() # Returns a dictionary with multiple attributes and values. print() #View the type of data returned print(type()) # Get the value of the tag's attribute based on the attribute, returning a list of values print(['class']) # Assign a value to the class attribute, which is converted from a list to a string. ['class']=['Web','Site'] print()
V. css selector
- Most CSS selectors are supported, such as common label selectors, class selectors, id selectors, and hierarchical selectors.
- By adding a selector to the select method, you can search for content in the HTML document that corresponds to it.
html = """ <html> <head> <title>Learn Programming from Scratch</title> </head> <body> <p class="intro"><b>i am autofelix</b></p> <p class="nickname">Flying Bunny</p> <a href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >csdnhomepage</a> <a href="/u/autofelix/publish" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >infoqhomepage</a> <a href="https://blog./autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >51ctohomepage</a> <p class="attention">Kneeling for attention one-touch triple</p> <p class="introduce"> <a href="/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >博客园homepage</a> </p> </body> </html> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, '') # Find by element label print(('nickname')) # Find by attribute selector print(('a[href]')) # Find by class print(('.attention')) #Descendant node lookup print(('html head title')) #Find sibling nodes print(('p + a')) #Select p-tagged sibling nodes based on ids print(('p ~ #csdn')) #nth-of-type(n) selector, used to match the nth sibling element of the same type print(('p ~ a:nth-of-type(1)')) # Find child nodes print(('p > a')) print(('.introduce > #cnblogs'))
VI. Node traversal
- You can use contents, children to iterate through the child nodes.
- You can use parent and parents to iterate through the parent nodes.
- You can use next_sibling and previous_sibling to traverse sibling nodes.
html = """ <html> <head> <title>Learn Programming from Scratch</title> </head> <body> <p class="intro"><b>i am autofelix</b></p> <p class="nickname">Flying Rabbit</p> <a href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >csdnhomepage</a> <a href="/u/autofelix/publish" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >infoqhomepage</a> <a href="https://blog./autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >51ctohomepage</a> <p class="attention">Kneeling for attention one-touch triple</p> <p class="introduce"> <a href="/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >博客园homepage</a> </p> </body> </html> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, '') body_tag= print(body_tag) # Output as a list, all child nodes print(body_tag.contents) # children is used to traverse the children nodes for child in body_tag.children: print(child)
VII. find_all method
- is a common method for parsing HTML documents
- The find_all() method searches all children of the current tag.
- and determine whether these nodes meet the filtering criteria
- Finally the eligible content is returned as a list
html = """ <html> <head> <title>Learn Programming from Scratch</title> </head> <body> <p class="intro"><b>i am autofelix</b></p> <p class="nickname">Flying Bunny</p> <a href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >csdnhomepage</a> <a href="/u/autofelix/publish" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >infoqhomepage</a> <a href="https://blog./autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >51ctohomepage</a> <p class="attention">Kneeling for attention one-touch triple</p> <p class="introduce"> <a href="/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >博客园homepage</a> </p> </body> </html> """ import re from bs4 import BeautifulSoup # Create a soup parse object soup = BeautifulSoup(html, '') # Find all a tags and return print(soup.find_all("a")) # Find the first two a-tags and return, only return two a-tags print(soup.find_all("a",limit=2)) # Find by tag attribute and attribute value print(soup.find_all("p",class_="nickname")) print(soup.find_all()) # list line book find tag tags print(soup.find_all(['b','a'])) # regular expression matching id attribute values print(soup.find_all('a',id=(r'.\d'))) print(soup.find_all(id=True)) # True can match any value, the following code will find all tags and return the corresponding tag name for tag in soup.find_all(True): print(,end=" ") # Output all tag labels starting with b for tag in soup.find_all(("^b")): print() # Written before simplification soup.find_all("a") # Simplified writing soup("a")
VIII. The find method
html = """ <html> <head> <title>Learn Programming from Scratch</title> </head> <body> <p class="intro"><b>i am autofelix</b></p> <p class="nickname">Flying Bunny</p> <a href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >csdnhomepage</a> <a href="/u/autofelix/publish" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >infoqhomepage</a> <a href="https://blog./autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >51ctohomepage</a> <p class="attention">Kneeling for attention one-touch triple</p> <p class="introduce"> <a href="/autofelix" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >博客园homepage</a> </p> </body> </html> """ import re from bs4 import BeautifulSoup # Create a soup parse object soup = BeautifulSoup(html, '') # Find the first a and return the result directly print(('a')) # Find title print(('intro')) # Match the a tag with the specified href attribute print(('a',href='')) # Regular matching based on attribute values print((class_=('tro'))) # attrs parameter values print((attrs={'class': 'introduce'})) # When using find, if no query tag is found, None is returned, and the find_all method returns an empty list. print(('aa')) print(soup.find_all('bb')) # Simplified writing print() # The above code is equivalent to print(("head").find("title"))
This article on python BeautifulSoup web page parsing is introduced to this article, more BeautifulSoup web page content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!