I. BeautifulSoup4 Basic Knowledge Supplementation
BeautifulSoup4
is a python parsing library, mainly used for parsing HTML and XML, in the crawler's body of knowledge will be more parsing HTML.
The library installation command is as follows:
pip install beautifulsoup4
BeautifulSoup
When parsing the data, it is necessary to rely on a third-party parser, theThe commonly used parsers with their advantages are shown below:
-
python standard library
: python built-in standard libraries for fault tolerance; -
lxml parser
: Fast and fault-tolerant; -
html5lib
: The most fault-tolerant and parsed in a way that is consistent with the browser.
This is demonstrated with a custom HTML codebeautifulsoup4
The basic use of the library, the test code is as follows:
<html> <head> <title>test (machinery etc)bs4module script</title> </head> <body> <h1>Eraser's Reptile Lesson</h1> <p>Use a customized piece of HTML Code to demonstrate</p> </body> </html>
utilizationBeautifulSoup
Simple operations on it include instantiating BS objects, outputting page labels, and so on.
from bs4 import BeautifulSoup text_str = """<html>. <head> <title> Test bs4 module script </title> </head> <body> <h1> Eraser's crawler class</h1> <p> Demonstration with 1 custom HTML code</p> <p>Demonstrate with 2 paragraphs of customized HTML code</p> </body> </html>. """ # Instantiate Beautiful Soup objects soup = BeautifulSoup(text_str, "") # The above is formatting a string into a Beautiful Soup object, which you can do from a file # soup = BeautifulSoup(open('')) print(soup) # Enter the page title title tag print() # Input page head tag print() # Test input paragraph labels p print() # The default is to get the first
We can use BeautifulSoup object to call webpage tags directly, there is a problem here, calling tags through BS object can only get the first position of the tags, such as the above code, only get ap
Tags, if you want to get more content, keep reading.
At this point in our study, we need to understand the four built-in objects in BeautifulSoup:
-
BeautifulSoup
: The basic object, the whole HTML object, which is generally treated as a Tag object; -
Tag
: Tag object, tags are the nodes in a web page, e.g. title, head, p; -
NavigableString
: Tag internal strings; -
Comment
: Annotated objects, not many scenarios of use inside the crawler.
The following code for you to demonstrate these kinds of objects appear in the scene, pay attention to the relevant comments in the code:
from bs4 import BeautifulSoup text_str = """<html>. <head> <title> Test bs4 module script </title> </head> <body> <h1> Eraser's crawler class</h1> <p> Demonstration with 1 custom HTML code</p> <p>Demonstrate with 2 paragraphs of customized HTML code</p> </body> </html>. """ # Instantiate Beautiful Soup objects soup = BeautifulSoup(text_str, "") # The above is formatting a string into a Beautiful Soup object, which you can do from a file # soup = BeautifulSoup(open('')) print(soup) print(type(soup)) # <class ''> # Enter the page title title tag print() print(type()) # <class ''> print(type()) # <class ''> # Input page head tag print()
insofar asTag object, there are two important attributes that arename
cap (a poem)attrs
from bs4 import BeautifulSoup text_str = """<html> <head> <title>beta (software)bs4module script</title> </head> <body> <h1>Eraser's Reptile Lesson</h1> <p>expense or outlay1Segment customization HTML Code to demonstrate</p> <p>expense or outlay2Segment customization HTML Code to demonstrate</p> <a href="" rel="external nofollow" rel="external nofollow" >CSDN node</a> </body> </html> """ # Instantiate Beautiful Soup objects soup = BeautifulSoup(text_str, "") print() # [document] print() # Get tag name title print() # Can get lower level tags by tag hierarchy print() # html as a special root tag that can be omitted print() # Can't get the a tag print() # Getting Properties
The above code demonstrates how to get thename
attributes andattrs
The usage of the attribute where theattrs
attribute gets a dictionary that can get the corresponding value by key.
To get the value of a tag's attribute, you can also use the following methods in BeautifulSoup:
print(["href"]) print(("href"))
gainNavigableString
boyfriend After getting the page tags, it's time to get the text inside the tags, which is done with the following code.
print()
In addition to this, you can use thetext
attributes andget_text()
method to get the label content.
print() print() print(.get_text())
It is also possible to get all the text within a label, using thestrings
cap (a poem)stripped_strings
Ready to go.
print(list()) # Get a space or a newline print(list(.stripped_strings)) # Remove spaces or newlines
Extended tags / node selector traversing the document tree
direct child node (math.)
A direct child element of a Tag object, which can be used with thecontents
cap (a poem)children
Attribute Acquisition.
from bs4 import BeautifulSoup text_str = """<html> <head> <title>beta (software)bs4module script</title> </head> <body> <div > <h1>Eraser's Reptile Lesson<span>best</span></h1> <p>expense or outlay1Segment customization HTML Code to demonstrate</p> <p>expense or outlay2Segment customization HTML Code to demonstrate</p> <a href="" rel="external nofollow" rel="external nofollow" >CSDN node</a> </div> <ul class="nav"> <li>home page (of a website)</li> <li>blog (loanword)</li> <li>Column Courses</li> </ul> </body> </html> """ # Instantiate Beautiful Soup objects soup = BeautifulSoup(text_str, "") The # contents property gets the direct children of the node and returns the contents as a list. print() # Back to list # The children attribute also gets the direct children of the node, and returns them as generators. print() # come (or go) back <list_iterator object at 0x00000111EE9B6340>
Note that both of the above attributes get theindirectlychild nodes, e.g.h1
Offspring tags within tagsspan
, will not be obtained separately.
If you wish to get all the tags, use thedescendants
attribute, which returns a generator where all tags, including text within tags, are fetched individually.
print(list())
Acquisition of other nodes (just know it, just look it up)
-
parent
respond in singingparents
: The direct parent and all parents; -
next_sibling
,next_siblings
,previous_sibling
,previous_siblings
: Indicates the next sibling node, all sibling nodes below, the previous sibling node, and all sibling nodes above, respectively. Since the newline character is also a node, all the properties should be used with a little bit of care about the newline character; -
next_element
,next_elements
,previous_element
,previous_elements
: These attributes represent the previous node or the next node, respectively. Note that they are not hierarchical, but are specific to all nodes, e.g., in the above codediv
The next node of the node ish1
but (not)div
The brother nodes of the node areul
。
Document Tree Search Related Functions
The first function to learn is thefind_all()
function.The prototype is shown below:
find_all(name,attrs,recursive,text,limit=None,**kwargs)
-
name
: this parameter is the name of the tag, e.g.find_all('p')
is to find all thep
tags, which can accept tag name strings, regular expressions and lists; -
attrs
: An incoming attribute, which can be passed as a dictionary parameter, e.g.attrs={'class': 'nav'}
The result is a list of tag types;
Examples of the usage of the above two parameters are as follows:
print(soup.find_all('li')) # Get all li print(soup.find_all(attrs={'class': 'nav'})) # Pass in the attrs attribute print(soup.find_all(("p"))) # Passing regularity, not well tested in real life print(soup.find_all(['a','p'])) # passing list
-
recursive
: Callfind_all ()
method, BeautifulSoup will retrieve all the children of the current tag, if you only want to search the direct children of the tag, you can use the parameterrecursive=False
The test code is as follows:
print(.find_all(['a','p'],recursive=False)) # passing list
-
text
: can retrieve the contents of a text string in a document, with thename
The optional values of the parameters are the same as thetext
The parameters accept a tag name string, a regular expression, and a list;
print(soup.find_all(text='Home')) # ['Home'] print(soup.find_all(text=("^ First."))) # ['Home'] print(soup.find_all(text=["Home",('Lesson')])) # ['Eraser's reptile class', 'Home', 'Column Courses']
-
limit
: Can be used to limit the number of results returned; -
kwargs
: If a parameter with a specified name is not one of the built-in parameter names for the search, the search will treat the parameter as an attribute of the tag. This is done by pressing theclass
Attribute search asclass
is a reserved word in python and needs to be written asclass_
pressclass_
When searching, only one CSS class name is enough, if you need more than one CSS name, you need to fill in the same order as the tags.
print(soup.find_all(class_ = 'nav')) print(soup.find_all(class_ = 'nav li'))
It is also important to note that there are some attributes in the web node that cannot be used in a search as akwargs
parameter is used, e.g.html5
hit the nail on the headdata-*
attribute, which needs to be passed through theattrs
parameter to match.
together with
find_all()
A list of other methods that are basically the same for method users is shown below:
-
find()
: function prototypefind( name , attrs , recursive , text , **kwargs )
that returns a matching element; -
find_parents(),find_parent()
: function prototypefind_parent(self, name=None, attrs={}, **kwargs)
, returns the parent of the current node; -
find_next_siblings(),find_next_sibling()
: function prototypefind_next_sibling(self, name=None, attrs={}, text=None, **kwargs)
that returns the next sibling node of the current node; -
find_previous_siblings(),find_previous_sibling()
: As above, returns the previous sibling of the current node; -
find_all_next(),find_next(),find_all_previous () ,find_previous ()
: function prototypefind_all_next(self, name=None, attrs={}, text=None, limit=None, **kwargs)
, retrieves the descendant nodes of the current node.
CSS Selector The knowledge points in this subsection are similar topyquery
A bit of a bump, core useselect()
method can be implemented, and the return data is a list tuple.
- Find by tag name.
("title")
; - Lookup by class name.
(".nav")
; - Search by id name.
("#content")
; - By combinatorial search.
("div#content")
; - By attribute lookup.
("div[id='content'")
,("a[href]")
;
There are a few more tricks you can use when looking through attributes, theExample:
-
^=
: You can get nodes that start with XX:
print(('ul[class^="na"]'))
-
*=
: Get the node whose attributes contain the specified character:
print(('ul[class*="li"]'))
II. Crawler Cases
BeautifulSoup after mastering the basics, in the preparation of the crawler case, it is very simple, this time to collect the targetnodeThe target site has a large number of artistic QR codes that can be used by Design Big Brother as a reference.
The following applies the tag and attribute retrieval of the BeautifulSoup module, the complete code is as follows:
from bs4 import BeautifulSoup import requests import logging (level=) def get_html(url, headers) -> None: try: res = (url=url, headers=headers, timeout=3) except Exception as e: ("Acquisition anomaly", e) if res is not None: html_str = soup = BeautifulSoup(html_str, "") imgs = soup.find_all(attrs={'class': 'lazy'}) print("The amount of data obtained is", len(imgs)) datas = [] for item in imgs: name = ('alt') src = item["src"] (f"{name},{src}") # Get the splice data ((name, src)) save(datas, headers) def save(datas, headers) -> None: if datas is not None: for item in datas: try: # Grab the picture res = (url=item[1], headers=headers, timeout=5) except Exception as e: (e) if res is not None: img_data = with open("./imgs/{}.jpg".format(item[0]), "wb+") as f: (img_data) else: return None if __name__ == '__main__': headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36" } url_format = "http:///#p{}" urls = [url_format.format(i) for i in range(1, 2)] get_html(urls[0], headers)
This code test output uses thelogging
module to realize the effect shown in the figure below. The test only collects 1 page of data, if you need to expand the scope of collection, you only need to modify themain
The rules for page numbering within the function are sufficient. == In the process of writing the code, I found that the data request is of type POST and the data return format is JSON, so this case is only as a case of BeautifulSoup to get started.
to this article on the python beautifulsoup4 module details of the article is introduced to this, more related python beautifulsoup4 content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!