SoFunction
Updated on 2024-11-19

python beautifulsoup4 module details

I. BeautifulSoup4 Basic Knowledge Supplementation

BeautifulSoup4 is a python parsing library, mainly used for parsing HTML and XML, in the crawler's body of knowledge will be more parsing HTML.

The library installation command is as follows:

pip install beautifulsoup4

BeautifulSoup When parsing the data, it is necessary to rely on a third-party parser, theThe commonly used parsers with their advantages are shown below:

  • python standard library: python built-in standard libraries for fault tolerance;
  • lxml parser: Fast and fault-tolerant;
  • html5lib: The most fault-tolerant and parsed in a way that is consistent with the browser.

This is demonstrated with a custom HTML codebeautifulsoup4 The basic use of the library, the test code is as follows:

<html>
  <head>
    <title>test (machinery etc)bs4module script</title>
  </head>
  <body>
    <h1>Eraser's Reptile Lesson</h1>
    <p>Use a customized piece of HTML Code to demonstrate</p>
  </body>
</html>

utilizationBeautifulSoup Simple operations on it include instantiating BS objects, outputting page labels, and so on.

from bs4 import BeautifulSoup
text_str = """<html>.
<head>
<title> Test bs4 module script </title>
</head>
<body>
<h1> Eraser's crawler class</h1>
<p> Demonstration with 1 custom HTML code</p>
<p>Demonstrate with 2 paragraphs of customized HTML code</p>
</body>
</html>.
"""
# Instantiate Beautiful Soup objects
soup = BeautifulSoup(text_str, "")
# The above is formatting a string into a Beautiful Soup object, which you can do from a file
# soup = BeautifulSoup(open(''))
print(soup)
# Enter the page title title tag
print()
# Input page head tag
print()

# Test input paragraph labels p
print() # The default is to get the first

We can use BeautifulSoup object to call webpage tags directly, there is a problem here, calling tags through BS object can only get the first position of the tags, such as the above code, only get ap Tags, if you want to get more content, keep reading.

At this point in our study, we need to understand the four built-in objects in BeautifulSoup:

  • BeautifulSoup: The basic object, the whole HTML object, which is generally treated as a Tag object;
  • Tag: Tag object, tags are the nodes in a web page, e.g. title, head, p;
  • NavigableString: Tag internal strings;
  • Comment: Annotated objects, not many scenarios of use inside the crawler.

The following code for you to demonstrate these kinds of objects appear in the scene, pay attention to the relevant comments in the code:

from bs4 import BeautifulSoup
text_str = """<html>.
<head>
<title> Test bs4 module script </title>
</head>
<body>
<h1> Eraser's crawler class</h1>
<p> Demonstration with 1 custom HTML code</p>
<p>Demonstrate with 2 paragraphs of customized HTML code</p>
</body>
</html>.
"""
# Instantiate Beautiful Soup objects
soup = BeautifulSoup(text_str, "")
# The above is formatting a string into a Beautiful Soup object, which you can do from a file
# soup = BeautifulSoup(open(''))
print(soup)
print(type(soup))  # <class ''>
# Enter the page title title tag
print()
print(type()) # <class ''>
print(type()) # <class ''>
# Input page head tag
print()

insofar asTag object, there are two important attributes that arename cap (a poem)attrs

from bs4 import BeautifulSoup
text_str = """<html>
	<head>
		<title>beta (software)bs4module script</title>
	</head>
	<body>
		<h1>Eraser's Reptile Lesson</h1>
		<p>expense or outlay1Segment customization HTML Code to demonstrate</p>
		<p>expense or outlay2Segment customization HTML Code to demonstrate</p>
		<a href="" rel="external nofollow"  rel="external nofollow" >CSDN node</a>
	</body>
</html>
"""
# Instantiate Beautiful Soup objects
soup = BeautifulSoup(text_str, "")
print() # [document]
print() # Get tag name title
print() # Can get lower level tags by tag hierarchy
print() # html as a special root tag that can be omitted
print() # Can't get the a tag
print() # Getting Properties

The above code demonstrates how to get thename attributes andattrs The usage of the attribute where theattrs attribute gets a dictionary that can get the corresponding value by key.

To get the value of a tag's attribute, you can also use the following methods in BeautifulSoup:

print(["href"])
print(("href"))

gainNavigableString boyfriend After getting the page tags, it's time to get the text inside the tags, which is done with the following code.

print()

In addition to this, you can use thetext attributes andget_text() method to get the label content.

print()
print()
print(.get_text())

It is also possible to get all the text within a label, using thestrings cap (a poem)stripped_strings Ready to go.

print(list()) # Get a space or a newline
print(list(.stripped_strings)) # Remove spaces or newlines

Extended tags / node selector traversing the document tree

direct child node (math.)

A direct child element of a Tag object, which can be used with thecontents cap (a poem)children Attribute Acquisition.

from bs4 import BeautifulSoup
text_str = """<html>
	<head>
		<title>beta (software)bs4module script</title>
	</head>
	<body>
		<div >
			<h1>Eraser's Reptile Lesson<span>best</span></h1>
            <p>expense or outlay1Segment customization HTML Code to demonstrate</p>
            <p>expense or outlay2Segment customization HTML Code to demonstrate</p>
            <a href="" rel="external nofollow"  rel="external nofollow" >CSDN node</a>
		</div>
        <ul class="nav">
            <li>home page (of a website)</li>
            <li>blog (loanword)</li>
            <li>Column Courses</li>
        </ul>

	</body>
</html>
"""
# Instantiate Beautiful Soup objects
soup = BeautifulSoup(text_str, "")
The # contents property gets the direct children of the node and returns the contents as a list.
print() # Back to list
# The children attribute also gets the direct children of the node, and returns them as generators.
print() # come (or go) back <list_iterator object at 0x00000111EE9B6340>

Note that both of the above attributes get theindirectlychild nodes, e.g.h1 Offspring tags within tagsspan , will not be obtained separately.

If you wish to get all the tags, use thedescendants attribute, which returns a generator where all tags, including text within tags, are fetched individually.

print(list())

Acquisition of other nodes (just know it, just look it up)

  • parent respond in singingparents: The direct parent and all parents;
  • next_siblingnext_siblingsprevious_siblingprevious_siblings: Indicates the next sibling node, all sibling nodes below, the previous sibling node, and all sibling nodes above, respectively. Since the newline character is also a node, all the properties should be used with a little bit of care about the newline character;
  • next_elementnext_elementsprevious_elementprevious_elements: These attributes represent the previous node or the next node, respectively. Note that they are not hierarchical, but are specific to all nodes, e.g., in the above codediv The next node of the node ish1but (not)div The brother nodes of the node areul

Document Tree Search Related Functions

The first function to learn is thefind_all() function.The prototype is shown below:

find_all(name,attrs,recursive,text,limit=None,**kwargs)
  • name: this parameter is the name of the tag, e.g.find_all('p') is to find all thep tags, which can accept tag name strings, regular expressions and lists;
  • attrs: An incoming attribute, which can be passed as a dictionary parameter, e.g.attrs={'class': 'nav'}The result is a list of tag types;

Examples of the usage of the above two parameters are as follows:

print(soup.find_all('li')) # Get all li
print(soup.find_all(attrs={'class': 'nav'})) # Pass in the attrs attribute
print(soup.find_all(("p"))) # Passing regularity, not well tested in real life
print(soup.find_all(['a','p'])) # passing list
  • recursive: Callfind_all () method, BeautifulSoup will retrieve all the children of the current tag, if you only want to search the direct children of the tag, you can use the parameterrecursive=FalseThe test code is as follows:
print(.find_all(['a','p'],recursive=False)) # passing list
  • text: can retrieve the contents of a text string in a document, with thename The optional values of the parameters are the same as thetext The parameters accept a tag name string, a regular expression, and a list;
print(soup.find_all(text='Home')) # ['Home']
print(soup.find_all(text=("^ First."))) # ['Home']
print(soup.find_all(text=["Home",('Lesson')])) # ['Eraser's reptile class', 'Home', 'Column Courses']
  • limit: Can be used to limit the number of results returned;
  • kwargs: If a parameter with a specified name is not one of the built-in parameter names for the search, the search will treat the parameter as an attribute of the tag. This is done by pressing theclass Attribute search asclass is a reserved word in python and needs to be written asclass_pressclass_ When searching, only one CSS class name is enough, if you need more than one CSS name, you need to fill in the same order as the tags.
print(soup.find_all(class_ = 'nav'))
print(soup.find_all(class_ = 'nav li'))

It is also important to note that there are some attributes in the web node that cannot be used in a search as akwargsparameter is used, e.g.html5 hit the nail on the headdata-*attribute, which needs to be passed through theattrsparameter to match.

together withfind_all()A list of other methods that are basically the same for method users is shown below:

  • find(): function prototypefind( name , attrs , recursive , text , **kwargs )that returns a matching element;
  • find_parents(),find_parent(): function prototypefind_parent(self, name=None, attrs={}, **kwargs), returns the parent of the current node;
  • find_next_siblings(),find_next_sibling(): function prototypefind_next_sibling(self, name=None, attrs={}, text=None, **kwargs)that returns the next sibling node of the current node;
  • find_previous_siblings(),find_previous_sibling(): As above, returns the previous sibling of the current node;
  • find_all_next(),find_next(),find_all_previous () ,find_previous (): function prototypefind_all_next(self, name=None, attrs={}, text=None, limit=None, **kwargs), retrieves the descendant nodes of the current node.

CSS Selector The knowledge points in this subsection are similar topyqueryA bit of a bump, core useselect()method can be implemented, and the return data is a list tuple.

  • Find by tag name.("title")
  • Lookup by class name.(".nav")
  • Search by id name.("#content")
  • By combinatorial search.("div#content")
  • By attribute lookup.("div[id='content'")("a[href]")

There are a few more tricks you can use when looking through attributes, theExample:

  • ^=: You can get nodes that start with XX:
print(('ul[class^="na"]'))
  • *=: Get the node whose attributes contain the specified character:
print(('ul[class*="li"]'))

II. Crawler Cases

BeautifulSoup after mastering the basics, in the preparation of the crawler case, it is very simple, this time to collect the targetnodeThe target site has a large number of artistic QR codes that can be used by Design Big Brother as a reference.

The following applies the tag and attribute retrieval of the BeautifulSoup module, the complete code is as follows:

from bs4 import BeautifulSoup
import requests
import logging
(level=)
def get_html(url, headers) -> None:
    try:
        res = (url=url, headers=headers, timeout=3)
    except Exception as e:
        ("Acquisition anomaly", e)

    if res is not None:
        html_str = 
        soup = BeautifulSoup(html_str, "")
        imgs = soup.find_all(attrs={'class': 'lazy'})
        print("The amount of data obtained is", len(imgs))
        datas = []
        for item in imgs:
            name = ('alt')
            src = item["src"]
            (f"{name},{src}")
            # Get the splice data
            ((name, src))
        save(datas, headers)
def save(datas, headers) -> None:
    if datas is not None:
        for item in datas:
            try:
                # Grab the picture
                res = (url=item[1], headers=headers, timeout=5)
            except Exception as e:
                (e)

            if res is not None:
                img_data = 
                with open("./imgs/{}.jpg".format(item[0]), "wb+") as f:
                    (img_data)
    else:
        return None
if __name__ == '__main__':
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
    }
    url_format = "http:///#p{}"
    urls = [url_format.format(i) for i in range(1, 2)]
    get_html(urls[0], headers)

This code test output uses thelogging module to realize the effect shown in the figure below. The test only collects 1 page of data, if you need to expand the scope of collection, you only need to modify themain The rules for page numbering within the function are sufficient. == In the process of writing the code, I found that the data request is of type POST and the data return format is JSON, so this case is only as a case of BeautifulSoup to get started.

to this article on the python beautifulsoup4 module details of the article is introduced to this, more related python beautifulsoup4 content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!