In this article, examples of python crawler learning notes of Beautifulsoup module usage. Shared for your reference, as follows:
Related Content:
- What is beautifulsoup
- Use of bs4
- import module
- Choosing to use a parser
- Finding with Tag Names
- Finding with find\find_all
- Use select to find
- import module
- Choosing to use a parser
- Finding with Tag Names
- Finding with find\find_all
- Use select to find
First published: 2018-03-02 00:10
What is beautifulsoup.
- is a Python library that extracts data from HTML or XML files. It allows you to navigate, find, and modify documents in the usual way with your favorite converter. (Official)
- beautifulsoup is a parser that parses out content specifically, saving us the trouble of writing regular expressions.
Beautiful Soup 3 is currently out of development, and we recommend using Beautiful Soup 4 for current projects.
Version of beautifulsoup: The latest version is bs4.
The use of bs4:
1. Import module:
from bs4 import beautifulsoup
2. Select the parser to parse the specified content:
from bs4 import beautifulsoup
soup=beautifulsoup(parsing content, parser)
Common parsers:,lxml,xml,html5lib
Sometimes it is necessary to install an installer parser: e.g. pip3 install lxml
BeautifulSoup supports Python's standard HTML parsing libraries by default, but it also supports some third-party parsing libraries:
Differences between parsers #Taken from the official documentation here
Beautiful Soup provides the same interface for different parsers, but the parsers themselves are different. The same document may be parsed by different parsers to produce tree documents with different structures. The biggest difference is between the HTML parser and the XML parser. Look at the following snippet parsed as an HTML structure.
BeautifulSoup("<a><b /></a>") # <html><head></head><body><a><b></b></a></body></html>Because the empty tag <b /> does not conform to the HTML standard, the parser parses it as <b> </b>.
The same document is parsed using XML as follows (parsing XML requires the lxml library). Note that the empty tag <b /> is still retained and the document is prefixed with an XML header instead of being contained within the <html> tag.
BeautifulSoup("<a><b /></a>", "xml") # <?xml version="1.0" encoding="utf-8"?> # <a><b/></a>There are also differences between HTML parsers, if the HTML document being parsed is in standard format, there is no difference between the parsers, only the speed of parsing is different, the result will return the correct document tree.
However, if the document being parsed is not in a standardized format, then different parsers may return different results. In the following example, parsing an incorrectly formatted document using lxml results in the </p> tags being simply ignored.
BeautifulSoup("<a></p>", "lxml") # <html><body><a></a></body></html>Parsing the same document with the html5lib library will give you different results.
BeautifulSoup("<a></p>", "html5lib") # <html><head></head><body><a><p></p></a></body></html>Instead of ignoring the </p> tag, the html5lib library automatically completes the tag and adds a <head> tag to the document tree.
The result of parsing with the pyhton built-in library is as follows.
BeautifulSoup("<a></p>", "") # <a></a>with lxml[7] Similarly, Python's built-in libraries ignore the </p> tag, unlike the html5lib libraries the standard libraries don't try to create standards-compliant document formatting or include document fragments in the <body> tag, and unlike the lxml libraries the standard libraries don't even try to add the <html> tag.
Since the document fragment "<a></p>" is in the wrong format, all of the above parsing methods can be considered "correct", and the html5lib library uses part of the HTML5 standard, so it's the closest. The html5lib library uses part of the HTML5 standard, so it is the closest to "correct". However, all parser structures can be considered "normal".
Different parsers can affect the outcome of code execution, and if you use theBeautifulSoup , then it is best to specify which parser was used to minimize unnecessary headaches.
3. operation [convention soup is beautifulsoup (parsing content, parser) to return to the parsed object]:
-
Finding with Tag Names
- Use the tag name to get the node:
- soup.tag name
- Use the tag name to get the node tag name [this focuses on name and is mainly used to get the tag name of the result when filtering in a non-tag name style]:
- soup.label.name
- Use the tag name to get the node attributes:
- soup.tag.attrs [get all attributes
- soup.label.attrs[attribute name] [get the specified attribute]
- soup.label [attribute name] [get specified attribute]
- soup.label.get(property name)
- Use the tag name to get the text content of the node:
- soup.label.text
- soup.label.string
- soup.label.get_text()
- Use the tag name to get the node:
Addendum 1: The above filters can be used with nested.
print()#ptaggedatab (of a window) (computing)
Supplement 2: The name,text,string,attrs methods above can be used when the result is aWhen the object.
from bs4 import BeautifulSoup html = """ <html > <head> <meta charset="UTF-8"> <title>this is a title</title> </head> <body> <p class="news">123</p> <p class="contents" >456</p> <a href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >advertisements</a> </body> </html> """ soup = BeautifulSoup(html,'lxml') print("Get node".center(50,'-')) print()#Get the head tag print()# Returns the first p tag # Get the node name print("Get Node Name".center(50,'-')) print() print((id='i1').name) # Get text content print("Get text content".center(50,'-')) print()# Returns the content of the title print()# Returns the content of the title print(.get_text()) #Get Properties print("----- Get Property -----") print()# Returns the contents of the label as a dictionary print(['class'])# Return the values of the tags as a list print(['class'])# Return the values of the tags as a list print(('class')) ############# t= print(type(t))#<class ''> print()#title print() #Nested choices. print()
- Get child node [direct get will also get '\n' and will assume '\n' is also a label]:
- soup.label.contents [return value is a list]
- soup.label.children [return value is an iterable object, iterating is required to get the actual child nodes]
- Get the descendant node:
- soup.label.descendants [return value is also an iterable object, the actual child nodes need to be iterated]
- Get the parent node:
- soup.label.parent
- Get ancestor node [parent node, grandfather node, great-grandfather node...]::
- soup.label.parents []
- Get the sibling node:
- soup.next_sibling [get a sibling node after it]
- soup.next_siblings [gets all the sibling nodes after it] [return value is an iterable object]
- soup.previous_sibling [get previous sibling node]
- soup.previous_siblings [get all previous sibling nodes] [return value is an iterable object]
Supplement 3:Like Supplement 2, the above functions can all be used when the result is aobject at the time.
from bs4 import BeautifulSoup html = """ <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> <p class="news"><a >123456</a> <a >78910</a> </p><p class="contents" ></p> <a href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >advertisements</a> <span>aspan</span> </body> </html> """ soup = BeautifulSoup(html, 'lxml') # Get child nodes print("Get child node".center(50,'-')) print() print("\n") c=# Returned is an iterable object for i,child in enumerate(c): print(i,child) print("Get descendant nodes.".center(50,'-')) print() c2= for i,child in enumerate(c2): print(i,child) print("Get parent node".center(50,'-')) c3= print(c3) print("Get parent, ancestor node.".center(50,'-')) c4= print(c4) for i,child in enumerate(c4): print(i,child) print("Get brother node".center(50,'-')) print(.next_sibling) print(.previous_sibling) for i,child in enumerate(.next_siblings): print(i,child,end='\t') for i,child in enumerate(.previous_siblings): print(i,child,end='\t')
-
Use the find\find_all method:
- find( name , attrs , recursive , text , **kwargs ) [according to the parameters to find the corresponding label, but only return the first eligible results]
-
find_all( name , attrs , recursive , text , **kwargs ): [find the corresponding label according to the arguments, but only return all the results that match the conditions]
-
Introduction to the filter condition parameters:
-
name: the name of the label, according to the label name to filter labels
-
attrs: for attributes, to filter tags based on attribute key-value pairs, assignment can be: attribute name=value, attrs={attribute name:value} [but since class is a python keyword, you need to use class_].
-
text: for the text content, according to the specified text content to filter out the label, [alone use text as a filter condition, will only return text, so generally used with other conditions]
-
recursive: specifies whether the filtering is recursive or not, when False, it will not look in the descendant nodes of the child node, but only the child node.
-
-
The result of getting the node is aObject.So for getting attributes, text content, label name and other operations can be referred to the previous "use the label filtering results" when the method involved
from bs4 import BeautifulSoup html = """ <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> <p class="news"><a >123456</a> <a id='i2'>78910</a> </p><p class="contents" ></p> <a href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >advertisements</a> <span>aspan</span> </body> </html> """ soup = BeautifulSoup(html, 'lxml') print("---------------------") print(soup.find_all('a'),end='\n\n') print(soup.find_all('a')[0]) print(soup.find_all(attrs={'id':'i1'}),end='\n\n') print(soup.find_all(class_='news'),end='\n\n') print(soup.find_all('a',text='123456'))# print(soup.find_all(id='i2',recursive=False),end='\n\n')# a=soup.find_all('a') print(a[0].name) print(a[0].text) print(a[0].attrs)
-
Filtering with select [select uses CSS selection rules]:
- ('tag name'), represents the filtering out of the specified tags according to the tags
- CSS #xxx on behalf of the filter id, ('#xxx') on behalf of the id according to filter out the specified tags, the return value is a list of
- CSS in . #### stands for filter class, ('.xxx') on behalf of the class according to filter out the specified tags, the return value is a list of
- Nested select.("#xxx .xxxx"), e.g. ("#id2 .news") is id="id2" tag under class= "news" tag, the return value is a list of
- The result of getting the node is aObject.So for getting attributes, text content, label name and other operations can be referred to the previous "use the label filtering results" when the method involved
from bs4 import BeautifulSoup html = """ <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> <p class="news"><a >123456</a> <a id='i2'>78910</a> </p><p class="contents" ></p> <a href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >advertisements</a> <span class="span1" id='i4'>aspan</span> </body> </html> """ soup = BeautifulSoup(html, 'lxml') sp1=('span')# The return result is a list whose elements are bs4 element label objects print(("#i2"),end='\n\n') print((".news"),end='\n\n') print((".news #i2"),end='\n\n') print(type(sp1),type(sp1[0])) print(sp1[0].name)The elements inside the #list are the bs4 element label objects. print(sp1[0].attrs) print(sp1[0]['class'])
Supplement 4:
For cases where the code is not complete, you can use the()to auto-completion, it is generally recommended to use, in order to avoid uneven code.
from bs4 import BeautifulSoup html = """ <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> <p class="news"><a >123456</a> <a id='i2'>78910</a> </p><p class="contents" ></p> <a href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >advertisements</a> <span class="span1" id='i4'>aspan </html> """ soup = BeautifulSoup(html, 'lxml') c=()# The above html string is missing </span> and </body> at the end. print(c)
For a more detailed introduction, you can refer to the official documentation, which is happily available in simpler Chinese:
/software/BeautifulSoup/bs4/doc/
More about Python related content can be viewed on this site's topic: thePython Socket Programming Tips Summary》、《Python Regular Expression Usage Summary》、《Python Data Structures and Algorithms Tutorial》、《Summary of Python function usage tips》、《Summary of Python string manipulation techniques》、《Python introductory and advanced classic tutorialsand theSummary of Python file and directory manipulation techniques》
I hope the description of this article will help you in Python programming.