Introduction to the python crawler module Beautiful Soup
Simply put, Beautiful Soup is a python library whose primary function is to crawl data from web pages. The official explanation is as follows: Beautiful Soup provides some simple, python-style functions to handle navigation, searching, modifying the analytic tree, etc. It is a toolkit that provides users with the data they need to crawl by parsing documents. It is a toolkit , by parsing the document for the user needs to crawl the data , because of the simple , so do not need much code can write a complete application . Beautiful Soup automatically converts the input document into Unicode encoding , the output document into utf-8 encoding . You don't need to think about the encoding unless the document doesn't specify an encoding, in which case Beautiful Soup doesn't automatically recognize the encoding. Then, you just need to specify the original encoding and you're done. Beautiful Soup has become as good a python interpreter as lxml and html6lib, providing users with the flexibility to use different parsing strategies or powerful speed.
python crawler module Beautiful Soup installation
Beautiful Soup 3 is currently out of development, and it is recommended to use Beautiful Soup 4 in current projects, but it has been ported to BS4, which means we need to import bs4 when importing. So the version we use here is Beautiful Soup 4.3.2 (BS4 for short). It is also said that BS4 does not support Python3 well enough, but I am using Python2.7.7, so if any of you are using Python3, you may want to consider downloading the BS3 version. You can install BS3 using pip or easy_install, either of which will work.
easy_install beautifulsoup4
pip install beautifulsoup4
If you want to install the latest version, please directly download the installation package to install manually, which is also a very convenient method. After downloading, unzip the package and run the following commands to complete the installation
sudo python install
Then you need to install lxml
easy_install lxml
pip install lxml
Another alternative parser is the pure Python implementation of html5lib , html5lib is parsed in the same way as a browser, and can be installed by choosing one of the following methods.
easy_install html5lib
pip install html5lib
Beautiful Soup supports the Python standard library in the HTML parser , but also supports some third-party parsers , if we do not install it , then Python will use Python's default parser , the lxml parser is more powerful , faster , we recommend installing .
resolver | Usage | dominance | inferior |
Python Standard Library | BeautifulSoup(markup, “”) | Python's built-in standard library Moderate execution speed Documentation fault tolerance | Poor document fault tolerance in Python versions prior to 2.7.3 or 3.2.2) |
lxml HTML parser | BeautifulSoup(markup, “lxml”) | Fast Documentation Fault Tolerance | Requires installation of C libraries |
lxml XML parser | BeautifulSoup(markup, [“lxml”, “xml”]) BeautifulSoup(markup, “xml”) | Fast The only parser that supports XML | Requires installation of C libraries |
html5lib | BeautifulSoup(markup, “html5lib”) | Best Fault Tolerance Parses documents as a browser, generating HTML5-formatted documents. | slow |
Creating a Beautiful Soup object
First you must import the bs4 library
from bs4 import BeautifulSoup
We create a string, which we'll use in the examples that follow.
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" ><!-- Elsie --></a>, <a href="/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" >Lacie</a> and <a href="/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" >Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
Create beautifulsoup object
soup = BeautifulSoup(html)
Alternatively, we can create objects from native HTML files, for example
soup = BeautifulSoup(open(''))
The code above opens the local file and uses it to create the soup object. Let's print out the contents of the soup object, formatting the output as follows
print ()
Specify encoding: when the html is encoded in other types of encoding (not utf-8 and asc ii), such as GB2312, you need to specify the corresponding character encoding for BeautifulSoup to parse it correctly.
htmlCharset = "GB2312"
soup = BeautifulSoup(respHtml, fromEncoding=htmlCharset)
#!/usr/bin/python # -*- coding: UTF-8 -*- from bs4 import BeautifulSoup import re # Strings to be analyzed html_doc = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title aq"> <b> The Dormouse's story </b> </p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" >Elsie</a>, <a href="/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" >Lacie</a> and <a href="/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" >Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> """ # html string to create BeautifulSoup object soup = BeautifulSoup(html_doc, '', from_encoding='utf-8') # Output the first title tag print # Output the tag name of the first title tag print # Output the contents of the first title tag print # Output the tag name of the parent tag of the first title tag. print # Output the first p tag print # Output the content of the class attribute of the first p tag. print ['class'] # Output the content of the href attribute of the first a tag. print ['href'] ''' Attributes of soup can be added, deleted or modified. Again, soup's attributes operate like a dictionary. ''' # Modify the href attribute of the first a tag to / ['href'] = '/' # Add a name attribute to the first a tag ['name'] = u'Baidu' # Remove the first a tag with a class attribute of del ['class'] ## Output all child nodes of the first p tag print # Output the first a tag print # Output all a tags as a list print soup.find_all('a') # Output the first a tag whose id attribute is equal to link3 print () # Get all text content print(soup.get_text()) # Output all attributes of the first a tag print for link in soup.find_all('a'): # Get the content of the href attribute of a link print(('href')) Cyclic output of child nodes of #pairs for child in : print(child) # Regular matches, tags with b in the name for tag in soup.find_all(("b")): print()
import bs4#Import BeautifulSoup library Soup = BeautifulSoup(html)#where html can be a string or a handle It should be noted that BeautifulSoup automatically detects the encoding format of the incoming document and then converts it to Unicode format By using the two sentences as above, BS automatically generates the document into a parse tree as shown above.
Beautiful Soup's four object categories
Beautiful Soup will be a complex HTML document into a complex tree structure , each node is a Python object , all objects can be categorized into four types: .
- Tag
- NavigableString
- BeautifulSoup
- Comment
(1)Tag
What is a Tag? In layman's terms, it's a tag in HTML, such as
<title>The Dormouse's story</title>
<a class="sister" href="///" >jb51</a>
The above title a and so on HTML tags plus the content included is Tag, the following we feel how to use Beautiful Soup to easily get Tags Each of the following code in the comment part that is the result of the operation
#<title>The Dormouse's story</title>
#<head><title>The Dormouse's story</title></head>
#<a class="sister" href="///" ><!-- Elsie --></a>
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
Use soup plus tag name to easily get the content of these tags, does not feel much more convenient than regular expressions? However, one thing is that it is looking for the first in all the contents of the label that meets the requirements, if you want to query all the labels, we are introduced later. Get the title tag, get the document in the first p tag, to get all the labels, you have to use find_all function. find_all function returns a sequence, you can loop on it, in turn, to get the thought of something . We can verify the types of these objects
print type()
#<class ''>
For Tag, it has two important attributes, name and attrs.
name
#[document]
#head
The soup object itself is special in that its name is [document], and for other internal tags, the output will be the name of the tag itself. attrs
#{'class': ['title'], 'name': 'dromouse'}
Here, we print out all the attributes of the p tag, and the type we get is a dictionary. If we wanted to get a property individually, we could do something like this, for example, we'd get the name of its class
print ['class']
#['title']
You can also do this, using the get method and passing in the name of the property, both of which are equivalent
print ('class')
#['title']
We can make changes to these attributes and contents, etc., for example
['class']="newClass"
#<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
It is also possible to remove this attribute, for example
del ['class']
#<p name="dromouse"><b>The Dormouse's story</b></p>
However, for the operation of modifying and deleting, it is not our main use, so we will not go into detail here, if you need, please check the official documentation provided earlier
head = ('head') #head = #head = [0].contents[0] print head html = [0] # <html> ... </html> head = [0] # <head> ... </head> body = [1] # <body> ... </body>
Attributes of a dictionary structure can be returned by accessing. Or access a specific attribute value this way and return it as a list if it is a multi-valued attribute.
(2)NavigableString
Now that we have the content of the label, the question arises, what do we do to get the text inside the label? It's easy, just use .string, for example
#The Dormouse's story
This way we can easily get the contents of the label, just think how much trouble it would be to use regular expressions. Its type is a NavigableString, which translates to a string that can be traversed, but we'd rather call it that. Let's check its type
print type()
#<class ''>
(3)BeautifulSoup
The BeautifulSoup object represents the entire content of a document. Most of the time, you can think of it as a Tag object, which is a special kind of Tag, and we can get its type, name, and attributes to get a feel for it.
print type()
#<type 'unicode'>
# [document]
#{} Empty dictionary
(4)Comment
Comment object is a special type of NavigableString object, in fact, the output still does not include the comment symbols, but if you do not deal with it properly, it may cause unexpected trouble to our text processing. Let's find a label with a comment
print type()
The results of the run are as follows
<a class="sister" href="///" ><!-- Elsie --></a>
Elsie
<class ''>
The content of the a tag is actually a comment, but if we use .string to output its content, we find that it has removed the comment symbol, so this may bring us unnecessary trouble. In addition, we print out its type and find that it is a Comment type, so we'd better make a judgment before using it, the judgment code is as follows
if type()==:
In the above code, we first determine its type, whether it is of type Comment, and then perform other operations, such as printout.
Beautiful Soup module traverses the document tree
(1) Direct child nodes
Tag.Tag_child1: access child nodes directly by subscript name. : Returns all child nodes as a list. : Builder, can be used for cyclic access: for child in Essentials: .contents .children property .contents The .content property of a tag outputs the tag's child nodes as a list. This can be obtained using the [num] form. Traverse the tree backwards with contents and forwards with parent
#[<title>The Dormouse's story</title>]
The output is a list, and we can use the list index to get one of its elements
print [0]
#<title>The Dormouse's story</title>
.children It doesn't return a list, but we can get all the children by iterating over them. We print out .children and see that it is a list generator object. You can convert it to a list using list. Of course, you can use the for statement to iterate through the children.
#<listiterator object at 0x7f71457f5710>
How do we get the contents? It's easy, just iterate over it, the code and the result is as follows
for child in :
print child
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ><!-- Elsie --></a>, <a class="sister" href="/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Lacie</a> and <a class="sister" href="/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p>
(2) All descendant nodes
Knowledge: .descendants property .descendants .contents and .children properties contain only the direct children of the tag, .descendants property can loop recursively through all the children of the tag, similarly to children we need to traverse to get the contents of it. : generator, which can be used for recursive access: for des
for child in :
print child
The result is as follows, you can find that all the nodes are printed out, Mr. into the outermost HTML tags, followed by stripped from the head tag one by one, and so on.
<html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ><!-- Elsie --></a>, <a class="sister" href="/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Lacie</a> and <a class="sister" href="/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body></html> <head><title>The Dormouse's story</title></head> <title>The Dormouse's story</title> The Dormouse's story <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ><!-- Elsie --></a>, <a class="sister" href="/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Lacie</a> and <a class="sister" href="/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <b>The Dormouse's story</b> The Dormouse's story <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ><!-- Elsie --></a>, <a class="sister" href="/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Lacie</a> and <a class="sister" href="/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Tillie</a>; and they lived at the bottom of a well.</p> Once upon a time there were three little sisters; and their names were <a class="sister" href="/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ><!-- Elsie --></a> Elsie , <a class="sister" href="/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Lacie</a> Lacie and <a class="sister" href="/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Tillie</a> Tillie ; and they lived at the bottom of a well. <p class="story">...</p> ...
(3) Node content
Knowledge Points: .string Property: Tag has only one String child node is, can be so accessed, otherwise return None: generator, can be used for circular access: for str in If a tag has only one NavigableString type child node, then this tag can use .string to get the child node. If a tag has only one child node, then the tag can also use the .string method, and the output is the same as the .string result of the current unique child node. In layman's terms: if there are no more tags inside a tag, then .string will return the contents of the tag. If there is only one label inside the tag, then .string will also return the innermost content. If there is more than one tag, then None is returned. e.g.
#The Dormouse's story
#The Dormouse's story
If the tag contains more than one child node, the tag is unable to determine which child node should be called by the string method, and the output of .string is None.
# None
(4) Multiple contents
Knowledge: .strings .stripped_strings property .strings gets more than one content, but you need to iterate through it to get it, as in the following example
for string in : print(repr(string)) # u"The Dormouse's story" # u'\n\n' # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' # u'\n'
.stripped_strings outputs strings that may contain a lot of spaces or blank lines, use .stripped_strings to remove the extra whitespace.
for string in soup.stripped_strings: print(repr(string)) # u"The Dormouse's story" # u"The Dormouse's story" # u'Once upon a time there were three little sisters; and their names were' # u'Elsie' # u',' # u'Lacie' # u'and' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'...'
(5) Parent node
Knowledge: .parent property Use parent to get the parent node. : parent : All nodes from the parent to the root.
body = html = # html is the father of body
p =
#body
content =
#title
(6) All parent nodes
Knowledge: .parents attribute You can recursively get all the parents of an element by using the .parents attribute of the element, for example
content = for parent in : print title head html [document]
(7) Sibling nodes
Knowledge: .next_sibling .previous_sibling properties
Use nextSibling, previousSibling to get the front and back siblings.
Tag.next_sibling
Tag.next_siblings
Tag.previous_sibling
Tag.previous_siblings
A sibling node is understood to be a node that is at a uniform level with this node. The .next_sibling attribute gets the next sibling node of the node, and .previous_sibling returns None if the node does not exist.
Note: The .next_sibling and .previous_sibling attributes of tags in the actual document are usually strings or whitespace, since whitespace or newline can also be treated as a node, so the result may be whitespace or newline.
print .next_sibling # Actual place is blank print .prev_sibling #None Returns None if there is no previous sibling. print .next_sibling.next_sibling #<p class="story">Once upon a time there were three little sisters; and their names were #<a class="sister" href="/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ><!-- Elsie --></a>, #<a class="sister" href="/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Lacie</a> and #<a class="sister" href="/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Tillie</a>; #and they lived at the bottom of a well.</p> #The next sibling of the next node is the node we can see
The .next method: can only do a .next against a single element, or a count of the contents list elements one by one. For example
[1]=u'HTML'
[2]=u'\n'
Then [1].next is equivalent to [2]
head = # head is on the same level as body and is the previous brother of body
p1 = [0] # p1, p2 are sons of body, we use contents[0] to get p1
p2 = # p2 is at the same level as p1 and is the latter brother of p1, and of course [1] can be obtained
contents[] can also be used to find the relationship between nodes, find ancestors or descendants can be used findParent(s), findNextSibling(s), findPreviousSibling(s)
(8) All sibling nodes
Knowledge: .next_siblings .previous_siblings properties The .next_siblings and .previous_siblings properties allow you to iterate through the siblings of the current node.
for sibling in .next_siblings: print(repr(sibling)) # u',\n' # <a class="sister" href="/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Lacie</a> # u' and\n' # <a class="sister" href="/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Tillie</a> # u'; and they lived at the bottom of a well.' # None
(9) Front and back nodes
Knowledge point: The .next_element .previous_element property differs from .next_sibling .previous_sibling in that it is not specific to sibling nodes, but at all nodes, regardless of hierarchy. For example, the head node is
<head><title>The Dormouse's story</title></head>
Then its next node is the title, which is not hierarchical.
print .next_element
#<title>The Dormouse's story</title>
(10) All front and back nodes
Knowledge: .next_elements .previous_elements property The .next_elements and .previous_elements iterators allow you to access the parsed content of a document either forward or backward as if the document were being parsed.
for element in last_a_tag.next_elements: print(repr(element)) # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # <p class="story">...</p> # u'...' # u'\n' # None
The above is the basic usage of traversing the document tree.
Search the document tree
The most commonly used is the find_all() function (1) find_all( name , attrs , recursive , text , **kwargs ) find_all() method searches for all the tag children of the current tag and determines whether they meet the conditions of the filter (1) name parameter The name parameter finds all the tag children with the name, string objects are automatically ignored. name parameter can find all tags with name, string objects will be automatically ignored.
# The first parameter is the name of the Tag tag.find_all(‘title') #Get "<title>&%^&*</title>", which results in a list. The second parameter is the matching attribute tag.find_all(“title”,class=”sister”) # Get something like "<title class = "sister">%^*&</title>. # The second parameter can also be a string, to get the result of string matching tag.find_all(“title”,”sister”) #support sth.”<title class = “sister”>%^*&</title>
A. Pass a string The simplest filter is a string. In the search method, pass a string parameter, Beautiful Soup will look for a complete match with the string of content, the following example is used to find all the document & lt; b & gt; tags
soup.find_all('b') # [<b>The Dormouse's story</b>] print soup.find_all('a') #[<a class="sister" href="/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ><!-- Elsie --></a>, <a class="sister" href="/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Lacie</a>, <a class="sister" href="/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Tillie</a>]
B. Passing Regular Expressions If you pass a regular expression as an argument, Beautiful Soup will match the content by using match() on the regular expression. The following example finds all tags starting with b, which means that both <body> and <b> tags should be found.
import re for tag in soup.find_all(("^b")): print() # body # b
C. Pass a list If you pass a list parameter, Beautiful Soup will return the content that matches any element in the list. The following code finds all <a> tags and <b> tags in the document
soup.find_all(["a", "b"]) # [<b>The Dormouse's story</b>, # <a class="sister" href="/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Elsie</a>, # <a class="sister" href="/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Lacie</a>, # <a class="sister" href="/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Tillie</a>]
D. Pass True True can match any value, the following code finds all the tags, but does not return a string node.
for tag in soup.find_all(True): print() # html # head # title # body # p # b # p # a # a
E. Passing Methods If there is no suitable filter, then a method can be defined that accepts only one element parameter [4] , if the method returns True then the current element matches and was found, if not it returns False. The following method checks the current element and returns True if it contains a class attribute but not an id attribute.
def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id')
Passing this as an argument to the find_all() method will get all <p> tags.
soup.find_all(has_class_but_no_id) # [<p class="title"><b>The Dormouse's story</b></p>, # <p class="story">Once upon a time there were...</p>, # <p class="story">...</p>]
(2) keyword parameter Note: If a specified name of the parameter is not searching the built-in parameter name, the search will take the parameter as a specified name tag attribute to search, if it contains a name for the id of the parameter, Beautiful Soup will search the "id" attribute of each tag.
soup.find_all(id='link2')
# [<a class="sister" href="/lacie" >Lacie</a>]
If the href parameter is passed, Beautiful Soup will search for the "href" attribute of each tag.
soup.find_all(href=("elsie"))
# [<a class="sister" href="/elsie" >Elsie</a>]
Multiple attributes of a tag can be filtered at the same time by using multiple parameters with the specified names.
soup.find_all(href=("elsie"), id='link1')
# [<a class="sister" href="/elsie" >three</a>]
Here we want to filter by class, but class is a python keyword. Just add an underscore.
soup.find_all("a", class_="sister")
# [<a class="sister" href="/elsie" >Elsie</a>,
# <a class="sister" href="/lacie" >Lacie</a>,
# <a class="sister" href="/tillie" >Tillie</a>]
Some tag attributes can't be used in search, such as the data-* attribute in HTML5.
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
However, it is possible to search for tags containing special attributes by defining a dictionary parameter in the attrs parameter of the find_all() method
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
(3) The text parameter allows you to search the contents of the document for strings. As with the optional name parameter, the text parameter accepts strings, regular expressions, lists, True
soup.find_all(text="Elsie") # [u'Elsie'] soup.find_all(text=["Tillie", "Elsie", "Lacie"]) # [u'Elsie', u'Lacie', u'Tillie'] soup.find_all(text=("Dormouse")) [u"The Dormouse's story", u"The Dormouse's story"]
4) limit parameter find_all() method returns all the search structure, if the document tree is very large then the search will be very slow. If we don't need all the results, we can use the limit parameter to limit the number of results returned. The effect is similar to the limit keyword in SQL; when the number of results reaches the limit, the search stops and returns results. There are three tags in the document tree that match the search criteria, but only two results are returned because we have limited the number of results returned.
soup.find_all("a", limit=2)
# [<a class="sister" href="/elsie" >Elsie</a>,
# <a class="sister" href="/lacie" >Lacie</a>]
5) recursive parameter When calling the find_all() method of a tag, Beautiful Soup will retrieve all the descendants of the current tag, if you only want to search the direct children of the tag, you can use the parameter recursive=False. A simple document.
<html> <head> <title> The Dormouse's story </title> </head> ...
Search results with or without recursive parameter.
.find_all("title")
# [<title>The Dormouse's story</title>]
.find_all("title", recursive=False)
# []
(2)find(name=None, attrs={}, recursive=True, text=None, **kwargs)
The only difference between this and the find_all() method is that the find_all() method returns a list of elements whose values contain a single element, whereas the find() method returns the result directly.
.find('p'),.findAll('p'): find returns a string value, and it returns the first tag pair found from the beginning. But if the first tag pair includes a lot of content, the parent level is very high, then at the same time its internal contains, this level of labels are also all find. findAll return value is a list, if you find a tag with the same name contains a number of tags with the same name, then the internal tags are attributed to the parent tag to show that the other elements of the list is no longer reflected in the internal sub-tags of those with the same name. That is, findAll will return all the results that meet the requirements, and returned as a list.
(οnclick='...')
(attrs={'style':r'outline:none;'}) # Used to find label bodies that have style='outline:none;' in the attributes.
tag search
find(tagname) # Directly search for a tag named tagname e.g.: find('head')
find(list) # Search for tags in a list, e.g.: find(['head', 'body'])
find(dict) # Search for tags in dict, e.g. :find({'head':True, 'body':True})
find(('')) # Search for tags that match the regularity, e.g.:find(('^p')) Search for tags starting with p
find(lambda) # search function returns a true tag, e.g.: find(lambda name: if len(name) == 1) search for a tag of length 1
find(True) # Search all tags
attrs search
find(id='xxx') # Find the id attribute of xxx
find(attrs={id=('xxx'), algin='xxx'}) # Find the ones where the id attribute matches the regularity and the algin attribute is xxx
find(attrs={id=True, algin=None}) # Find the ones that have the id attribute but not the algin attribute
resp1 = ('a', attrs = {'href': match1})
resp2 = ('h1', attrs = {'class': match2})
resp3 = ('img', attrs = {'id': match3})
text search Searching for text will invalidate the values given by other searches such as: tag, attrs. The method is the same as for tag
# u'This is paragraphone.'
# u'This is paragraphtwo.'
# Note: 1, the text of each tag includes its text and the text of its descendants. 2, all text has been automatically converted to unicode, if necessary, you can transcode yourself encode(xxx)
Recursive and Limit Properties
recursive=False means search only direct sons, otherwise search the whole subtree, default is True.
The limit attribute is used to limit the number of returns when using findAll or similar methods that return lists.
e.g.:findAll('p', limit=2): return the first two tags found
(3)find_parents() find_parent()
find_all() and find() only search all children, grandchildren, etc. of the current node. find_parents() and find_parent() are used to search for the parents of the current node, the search method is the same as that of a normal tag, searching the document for the contents of the document.
(4)find_next_siblings() find_next_sibling()
These two methods iterate through the .next_siblings attribute over all of the later resolved sibling tag nodes of the current tag, find_next_siblings() returns all of the later resolved sibling nodes that match the criteria, find_next_sibling() returns only the first tag node after the later resolved sibling nodes that match the criteria.
(5)find_previous_siblings() find_previous_sibling()
These two methods iterate over the previous parsed sibling tag nodes of the current tag using the .previous_siblings attribute, the find_previous_siblings() method returns all previous sibling nodes that meet the criteria, and the find_previous_sibling() method returns the first previous_siblings
(6)find_all_next() find_next()
These two methods iterate through the tags and strings after the current tag using the .next_elements attribute, the find_all_next() method returns all eligible nodes, and the find_next() method returns the first eligible node.
(7)find_all_previous() respond in singing find_previous()
These two methods iterate over the tags and strings in front of the current node using the .previous_elements attribute, the find_all_previous() method returns all eligible nodes, and the find_previous() method returns the first eligible node.
Note: The above (2) (3) (4) (5) (6) (7) method parameter usage and find_all() is identical, the principle is similar, will not repeat here.
CSS Selector
When writing CSS, the tag name is not modified, the class name is preceded by a dot, and the id name is preceded by # Here we can also use a similar method to filter the elements, the method used is (), the return type is list (1) Finding by tag name
print ('title')
#[<title>The Dormouse's story</title>]
print ('a')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a>, <a class="sister" href="/tillie" >Tillie</a>]
print ('b')
#[<b>The Dormouse's story</b>]
(2) Search by class name
print ('.sister')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a>, <a class="sister" href="/tillie" >Tillie</a>]
(3) Search by id name
print ('#link1')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>]
(4) Combined search Combined search that is and write class file, label name and class name, id name for the combination of the same principle, for example: find p tag, id is equal to the contents of link1, the two need to be separated by a space
print ('p #link1')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>]
Direct sub-tab lookup
print ("head > title")
#[<title>The Dormouse's story</title>]
(5) attribute search Find can also add attribute elements, attributes need to be enclosed in parentheses, note that the attributes and labels belong to the same node, so you can not add spaces in the middle, otherwise it will not be able to match to.
print ('a[class="sister"]')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a>, <a class="sister" href="/tillie" >Tillie</a>]
print ('a[href="/elsie"]')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>]
Similarly, attributes can still be combined with the above lookups, with spaces separating those not in the same node and no spaces in the same node
print ('p a[href="/elsie"]')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>]
The above select methods return results in the form of a list, which can be traversed, and then the get_text() method can be used to get its contents.
soup = BeautifulSoup(html, 'lxml')
print type(('title'))
print ('title')[0].get_text()
for title in ('title'):
print title.get_text()
This is another method of finding that is similar to the find_all method, doesn't it feel very convenient?
print soup.find_all("a", class_="sister") print ("") # Find by attribute print soup.find_all("a", attrs={"class": "sister"}) # Search by text print soup.find_all(text="Elsie") print soup.find_all(text=["Tillie", "Elsie", "Lacie"]) # Limit the number of results print soup.find_all("a", limit=2)
This article explains in detail the python crawler block Beautiful Soup from the installation to the detailed use of methods and examples, more about the use of python crawler block Beautiful Soup please see the following related links