SoFunction
Updated on 2024-11-19

Python Beautiful Soup module use tutorial details

I. Introduction to the module

Beautiful Soup is a Python library that extracts data from HTML or XML files. It enables you to navigate, find, and modify documents in the usual way with your favorite converter. Beautiful Soup will save you hours or even days of work.

II. Utilization of methodologies

1. Introduction of modules

# Introduced
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="/elsie" rel="external nofollow"  class="sister" >Elsie</a>,
<a href="/lacie" rel="external nofollow"  class="sister" >Lacie</a> and
<a href="/tillie" rel="external nofollow"  class="sister" >Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, '')

Four parsers

2. A few simple ways to browse structured data

#GetTag, in layman's terms, is a single tag in HTML

#GetTag, in layman's terms, is a single tag in HTML
                    # Get the entire title tag field: <title>The Dormouse's story</title>
               # Get title tag name: title
        # Get the name of the parent tag of the title: head
                        # Get the first p tag field: <p class="title"><b>The Dormouse's story</b></p>
['class']               # Get the value of the class attribute in the first p: title
('class')           # Equivalent to the above
                        # Get the first a tag field
soup.find_all('a')            # Get all a tag fields
()         # Get the field with attribute id value link3
['class'] = "newClass"  # These attributes and contents etc. can be modified
del ['class']             # It's also possible to delete this attribute
('a').get('id')      # Get the value of the id attribute in the a tag whose class value is story
             # gaintitleTagged values  :The Dormouse's story

III. Specific utilization

1、Get the label with the specified attribute

Method 1:Getting a single property
soup.find_all('div',)            # Get all div tags with id=even attribute
soup.find_all('div',attrs={'id':"even"})    # With the same effect as above #
Method II:
soup.find_all('div',,class_="square")            # Get all div tags with id=even and class=square attribute.
soup.find_all('div',attrs={"id":"even","class":"square"})    # Same effect as above

2、Get the attribute value of the label

Method 1:Extraction by subscripting
for link in soup.find_all('a'):
    print(link['href'])        //equivalent to print(('href'))
Method II:utilizationattrsparameter extraction
for link in soup.find_all('a'):
    print(['href'])

3、Get the content in the label

divs = soup.find_all('div')        # Get all div tags
for div in divs:                   # Loop over each div in the div
    a = div.find_all('a')[0]      # Find the first a tag in a div tag
    print()              # Output the contents of the a tag
If the result is not displayed correctly,can be converted tolistlistings

4、stripped_strings

Remove \n line breaks and other content stripped_strings

divs = soup.find_all('div')
for div in divs:
    infos = list(div.stripped_strings)        # Remove spaces, line breaks, etc.
    bring(infos)

IV. Output

1, formatted output prettify ()

prettify() method will Beautiful Soup document tree formatted in Unicode encoding output, each XML/HTML tags are exclusive of a line

markup = '<a href="/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >I linked to <i></i></a>'
soup = BeautifulSoup(markup)
()
# '<html>\n <head>\n </head>\n <body>\n  <a href="/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >\n...'
print(())
# <html>
#  <head>
#  </head>
#  <body>
#   <a href="/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >
#    I linked to
#    <i>
#     
#    </i>
#   </a>
#  </body>
# </html>

2、get_text()

If you only want to get the text contained in the tag, then you can call the get_text() method, which gets all the text contained in the tag, including the content of the descendant tags, and returns the result as a Unicode string.

markup = '<a href="/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >\nI linked to <i></i>\n</a>'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to \n'
.get_text()
u''

to this article on the use of Python Beautiful Soup module tutorials explain the article is introduced to this, more related to Python Beautiful Soup content please search for my previous posts or continue to browse the following related articles I hope you will support me in the future more!