Python Web Crawler Essentials of Beautiful Soup's usage instructions

I. Introduction to Beautiful Soup

Beautiful Soup is a powerful parsing tool that parses web pages with the help of features such as page structure and attributes.

It provides some functions to handle navigation, search, modify the analysis tree and other functions , Beautiful Soup does not need to consider the document encoding format . Beautiful Soup in the parsing actually need to rely on the parser , the common parser is lxml.

II. Use of Beautiful Soup

Test Example:

<!DOCTYPE html>
<html>
<head>
    <meta content="text/html;charset=utf-8" http-equiv="content-type" />
    <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
    <meta content="always" name="referrer" />
    <link href="/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="stylesheet" type="text/css" />
    <title>Baidu's website，You'll see. </title>
</head>
<body link="#0000cc">
  <div >
    <div >
        <div class="head_wrapper">
          <div >
            <a class="mnav" href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trnews">public information </a>
            <a class="mnav" href="https://" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trhao123">hao123 </a>
            <a class="mnav" href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trmap">atlase </a>
            <a class="mnav" href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trvideo">video </a>
            <a class="mnav" href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trtieba">electronic message board </a>
            <a class="bri" href="///more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">More Products </a>
          </div>
        </div>
    </div>
  </div>
</body>
</html>

1. Node selector

As we learned before, a web page is composed of several element nodes, and by extracting the specific content of a node, we can get some data presented by the interface. The use of node selectors can simplify the process of obtaining our data, and accurately obtain the data without using regular expressions.

from bs4 import BeautifulSoup

file = open("./",'rb')
html = ()
soup = BeautifulSoup(html,'lxml')
print()
print()
print()

[Running results]

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>Baidu's website，You'll see. </title>
</head>
<title>Baidu's website，You'll see. </title>
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">public information </a>

Analysis:

The first print data is the head node of the fetch page;

The second printout is to get the title node in the head node. Getting this node uses a nested selection because the title node is nested inside the head node;

The third printout is to get the a-node, and in the source code we see that there are many a-nodes, and the matching ends only at the first a-node. When there are multiple nodes, this selection means that only the first matching node will be selected, and other later nodes will be ignored.

2、Extract information

Generally the data we need is located in node names, attribute values, and text values, and the following code shows how to get the data in these three places:

from bs4 import BeautifulSoup

file = open("./",'rb')
html = ()
soup = BeautifulSoup(html,'lxml')
print()
print(['class'])
print(['href'])
print()

[Running results]

body
['mnav']

public information

Analysis:

The first one gets the body node name;

The second article gets the value of the class attribute of the a node;

The third one gets the value of the href attribute of the a node;

The fourth article gets the text value of the a node;

3、Associated Selection

(1) Child and grandchild nodes

Child nodes can call the contents attribute and the children attribute, and descendants can call the descendants attribute, and they return results of the generator type, outputting the matched information via a for loop.

from bs4 import BeautifulSoup

file = open("./",'rb')
html = ()
soup = BeautifulSoup(html,'lxml')
# print()
for i,content in enumerate():
    print(i,content)

[Running results]

0

1 <div >
<div >
<div class="head_wrapper">
<div >
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">public information </a>
<a class="mnav" href="https://" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">atlase </a>
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">video </a>
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">electronic message board </a>
<a class="bri" href="///more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">More Products </a>
</div>
</div>
</div>
</div>
2

(2) Parent and ancestor nodes

Getting the parent of a node can be done by calling the parent attribute, for example to get the parent of the title node in the example:

file = open("./",'rb')
html = ()
soup = BeautifulSoup(html,'lxml')
print()

[Running results]

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>Baidu's website，You'll see. </title>
</head>

Similarly, if you are trying to get the ancestor nodes of a node, you can call the parents property.

(3) Sibling nodes

Call next_sibling to get the next sibling element of the node;

Call previous_sibling to get the previous sibling element of the node;

Call next_siblings to take the next sibling of a node;

Call previous_siblings to get the previous sibling of the node;

4. Method selector

find_all（）

Finds all elements that match the condition, which is used as follows:

find_all(name,attrs,recursive,text,**kwargs)

（1）name

Queries an element by its node name, e.g. querying the a tag element in the example:

from bs4 import BeautifulSoup

file = open("./",'rb')
html = ()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a"))
for a in soup.find_all(name = "a"):
    print(a)

[Running results]

[<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">public information </a>, <a class="mnav" href="https://" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">atlase </a>, <a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">video </a>, <a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">electronic message board </a>, <a class="bri" href="///more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">More Products </a>]
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">public information </a>
<a class="mnav" href="https://" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">atlase </a>
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">video </a>
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">electronic message board </a>
<a class="bri" href="///more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">More Products </a>

（2）attrs

At query time we can also pass in the attributes of the label, the data type of the attrs parameter is a dictionary.

from bs4 import BeautifulSoup

file = open("./",'rb')
html = ()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",attrs = {"class":"bri"}))

[Running results]

[<a class="bri" href="///more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">More Products </a>]

As you can see, with the addition of the class="bri" attribute, the query results in a single a tag element.

（3）text

The text parameter can be used to match the text of the node, passed in as either a string or a regular expression object.

import re
from bs4 import BeautifulSoup

file = open("./",'rb')
html = ()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",text = ('News')))

[Running results]

[<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">public information </a>]

Contains only a tags with the text "news".

find（）

find() is used similarly to the former, the only difference is that find into match the first element searched and then return a single element, find_all() is to match all the eligible elements, return a list.

5, CSS selector

To use a CSS selector, call the select() method, passing in the appropriate CSS selector;

For example, use the CSS selector to get the a tag in the example

from bs4 import BeautifulSoup

file = open("./",'rb')
html = ()
soup = BeautifulSoup(html,'lxml')
print(('a'))
for a in ('a'):
    print(a)

[Running results]

[<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">public information </a>, <a class="mnav" href="https://" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">atlase </a>, <a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">video </a>, <a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">electronic message board </a>, <a class="bri" href="///more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">More Products </a>]
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">public information </a>
<a class="mnav" href="https://" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">atlase </a>
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">video </a>
<a class="mnav" href="" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">electronic message board </a>
<a class="bri" href="///more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">More Products </a>

Getting Properties

Get the href attribute in the above a tag

from bs4 import BeautifulSoup

file = open("./",'rb')
html = ()
soup = BeautifulSoup(html,'lxml')
for a in ('a'):
    print(a['href'])

[Running results]

https://

///more/

Get Text

Get the text content of the above a tag, use the get_text() method, or string to get the text content

from bs4 import BeautifulSoup

file = open("./",'rb')
html = ()
soup = BeautifulSoup(html,'lxml')
for a in ('a'):
    print(a.get_text())
    print()

[Running results]

public information
public information
hao123
hao123
atlase
atlase
video
video
electronic message board
electronic message board
More Products
More Products

to this article on the python web crawler to explain the use of Beautiful Soup article is introduced to this, more related python Beautiful Soup content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!