SoFunction
Updated on 2024-11-19

How to use BeautifulSoup web pages in python

I. Installation

  • Bautiful Soup is a third-party library, so it requires a separate download, which is very simple
  • Since BS4 relies on a document parser when parsing pages, you also need to install lxml as a parsing library
  • Python also comes with a document parsing library, but it's a little slower than lxml.
pip install bs4
pip install lxml
pip install html5lib

II. Analysis

  • Indicates the parser used to parse the document
  • The parser can also be lxml or html5lib.
html = '''
<div class="modal-dialog">
<div class="modal-content">
<div class="modal-header">
<button type="button" class="close" data-dismiss="modal">&times;</button>
<h4 class="modal-title">Modal title</h4>
</div>
<div class="modal-body">
...
</div>
<div class="modal-footer">
<a href="#" rel="external nofollow"  rel="external nofollow"  class="btn btn-default" data-dismiss="modal">Close</a>
<a href="#" rel="external nofollow"  rel="external nofollow"  class="btn btn-primary">Save</a>
</div>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, '')
#prettify() for formatting output html/xml documents
print(())

III. Analysis of External Documents

  • external documents, you can also read them by opening the
from bs4 import BeautifulSoup
fp = open('html_doc.html', encoding='utf8')
soup = BeautifulSoup(fp, 'lxml')

IV. Label selector

  • Tags (Tag) are the basic elements that make up an HTML document.
  • The desired content can be extracted by tag name and tag attributes
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p class="name nickname user"><b>i am autofelix</b></p>', '')
#Get the html code for the entire p tag
print()
# Getting the b-tag
print()
# Get the content of the p tag, using the NavigableString class string, text, get_text()
print()
# Returns a dictionary with multiple attributes and values.
print()
#View the type of data returned
print(type())
# Get the value of the tag's attribute based on the attribute, returning a list of values
print(['class'])
# Assign a value to the class attribute, which is converted from a list to a string.
['class']=['Web','Site']
print()

V. css selector

  • Most CSS selectors are supported, such as common label selectors, class selectors, id selectors, and hierarchical selectors.
  • By adding a selector to the select method, you can search for content in the HTML document that corresponds to it.
html = """
<html>
<head>
<title>Learn Programming from Scratch</title>
</head>
<body>
<p class="intro"><b>i am autofelix</b></p>
<p class="nickname">Flying Bunny</p>
<a href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >csdnhomepage</a>
<a href="/u/autofelix/publish" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >infoqhomepage</a>
<a href="https://blog./autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >51ctohomepage</a>
<p class="attention">Kneeling for attention one-touch triple</p>
<p class="introduce">
<a href="/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >博客园homepage</a>
</p>
</body>
</html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, '')
# Find by element label
print(('nickname'))
# Find by attribute selector
print(('a[href]'))
# Find by class
print(('.attention'))
#Descendant node lookup
print(('html head title'))
#Find sibling nodes
print(('p + a'))
#Select p-tagged sibling nodes based on ids
print(('p ~ #csdn'))
#nth-of-type(n) selector, used to match the nth sibling element of the same type
print(('p ~ a:nth-of-type(1)'))
# Find child nodes
print(('p > a'))
print(('.introduce > #cnblogs'))

VI. Node traversal

  • You can use contents, children to iterate through the child nodes.
  • You can use parent and parents to iterate through the parent nodes.
  • You can use next_sibling and previous_sibling to traverse sibling nodes.
html = """
<html>
<head>
<title>Learn Programming from Scratch</title>
</head>
<body>
<p class="intro"><b>i am autofelix</b></p>
<p class="nickname">Flying Rabbit</p>
<a href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >csdnhomepage</a>
<a href="/u/autofelix/publish" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >infoqhomepage</a>
<a href="https://blog./autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >51ctohomepage</a>
<p class="attention">Kneeling for attention one-touch triple</p>
<p class="introduce">
<a href="/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >博客园homepage</a>
</p>
</body>
</html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, '')
body_tag=
print(body_tag)
# Output as a list, all child nodes
print(body_tag.contents)
# children is used to traverse the children nodes
for child in body_tag.children:
print(child)

VII. find_all method

  • is a common method for parsing HTML documents
  • The find_all() method searches all children of the current tag.
  • and determine whether these nodes meet the filtering criteria
  • Finally the eligible content is returned as a list
html = """
<html>
<head>
<title>Learn Programming from Scratch</title>
</head>
<body>
<p class="intro"><b>i am autofelix</b></p>
<p class="nickname">Flying Bunny</p>
<a href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >csdnhomepage</a>
<a href="/u/autofelix/publish" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >infoqhomepage</a>
<a href="https://blog./autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >51ctohomepage</a>
<p class="attention">Kneeling for attention one-touch triple</p>
<p class="introduce">
<a href="/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >博客园homepage</a>
</p>
</body>
</html>
"""
import re
from bs4 import BeautifulSoup
# Create a soup parse object
soup = BeautifulSoup(html, '')
# Find all a tags and return
print(soup.find_all("a"))
# Find the first two a-tags and return, only return two a-tags
print(soup.find_all("a",limit=2))
# Find by tag attribute and attribute value
print(soup.find_all("p",class_="nickname"))
print(soup.find_all())
# list line book find tag tags
print(soup.find_all(['b','a']))
# regular expression matching id attribute values
print(soup.find_all('a',id=(r'.\d')))
print(soup.find_all(id=True))
# True can match any value, the following code will find all tags and return the corresponding tag name
for tag in soup.find_all(True):
print(,end=" ")
# Output all tag labels starting with b
for tag in soup.find_all(("^b")):
print()
# Written before simplification
soup.find_all("a")
# Simplified writing
soup("a")

VIII. The find method

html = """
<html>
<head>
  <title>Learn Programming from Scratch</title>
</head>
<body>
  <p class="intro"><b>i am autofelix</b></p>
  <p class="nickname">Flying Bunny</p>
  <a href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >csdnhomepage</a>
  <a href="/u/autofelix/publish" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >infoqhomepage</a>
  <a href="https://blog./autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >51ctohomepage</a>
  <p class="attention">Kneeling for attention one-touch triple</p>
  <p class="introduce">
    <a href="/autofelix" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  >博客园homepage</a>
  </p>
</body>
</html>
"""
import re
from bs4 import BeautifulSoup
# Create a soup parse object
soup = BeautifulSoup(html, '')
# Find the first a and return the result directly
print(('a'))
# Find title
print(('intro'))
# Match the a tag with the specified href attribute
print(('a',href=''))
# Regular matching based on attribute values
print((class_=('tro')))
# attrs parameter values
print((attrs={'class': 'introduce'}))
# When using find, if no query tag is found, None is returned, and the find_all method returns an empty list.
print(('aa'))
print(soup.find_all('bb'))
# Simplified writing
print()
# The above code is equivalent to
print(("head").find("title"))

This article on python BeautifulSoup web page parsing is introduced to this article, more BeautifulSoup web page content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!