SoFunction
Updated on 2024-11-18

Python using Beautiful Soup module to search for content details

preamble

We'll utilize the Beautiful Soup module's search functionality to search by tag name, tag attribute, document text, and regular expression.

Search Methods

Beautiful Soup's built-in search methods are as follows:

  • find()
  • find_all()
  • find_parent()
  • find_parents()
  • find_next_sibling()
  • find_next_siblings()
  • find_previous_sibling()
  • find_previous_siblings()
  • find_previous()
  • find_all_previous()
  • find_next()
  • find_all_next()

Use the find() method to search

The first thing you need to do is to create an HTML file for testing.

<html>
<body>
<div class="ecopyramid">
 <ul >
 <li class="producerlist">
  <div class="name">plants</div>
  <div class="number">100000</div>
 </li>
 <li class="producerlist">
  <div class="name">algae</div>
  <div class="number">100000</div>
 </li>
 </ul>
 <ul >
 <li class="primaryconsumerlist">
  <div class="name">deer</div>
  <div class="number">1000</div>
 </li>
 <li class="primaryconsumerlist">
  <div class="name">rabbit</div>
  <div class="number">2000</div>
 </li>
 </ul>
 <ul >
 <li class="secondaryconsumerlist">
  <div class="name">fox</div>
  <div class="number">100</div>
 </li>
 <li class="secondaryconsumerlist">
  <div class="name">bear</div>
  <div class="number">100</div>
 </li>
 </ul>
 <ul >
 <li class="tertiaryconsumerlist">
  <div class="name">lion</div>
  <div class="number">80</div>
 </li>
 <li class="tertiaryconsumerlist">
  <div class="name">tiger</div>
  <div class="number">50</div>
 </li>
 </ul>
</div>
</body>
</html>

We can do this through thefind() method to get the <ul> tag, which by default will get the first occurrence. Then get the <li> tag, which will still get the first occurrence by default, and then get the <div> tag, verifying that the first occurrence was obtained by outputting the content.

from bs4 import BeautifulSoup
with open('','r') as filename:
 soup = BeautifulSoup(filename,'lxml')
first_ul_entries = ('ul')
print first_ul_entries.

The find() method is specified as follows:

find(name,attrs,recursive,text,**kwargs) 

As shown in the code above, thefind() method takes five arguments: name, attrs, recursive, text, and **kwargs. the name, attrs, and text arguments are all available in thefind() method acts as a filter to improve the accuracy of the matching results.

search tags

In addition to the above code for searching for <ul> tags, we can also search for <li> tags, and the return result is also to return the first match that occurs.

tag_li = ('li')
# tag_li = (name = "li")
print type(tag_li)
print tag_li.

Search text

If we only want to search based on text content, we can pass only the text parameter :

search_for_text = (text='plants')
print type(search_for_text)
<class ''>

The returned result is also a NavigableString object.

Search by regular expression

The following piece of HTML text

<div>The below HTML has the information that has email ids.</div>
 abc@ 
<div>xyz@</div> 
 <span>foo@</span>

You can see that the abc@example email address is not included in any tags, which makes it impossible to find the email address based on the tags. At this point, we can use regular expressions to do the matching.

email_id_example = """
 <div>The below HTML has the information that has email ids.</div>
 abc@
 <div>xyz@</div>
 <span>foo@</span>
 """
email_soup = BeautifulSoup(email_id_example,'lxml')
print email_soup
# pattern = "\w+@\w+\.\w+"
emailid_regexp = ("\w+@\w+\.\w+")
first_email_id = email_soup.find(text=emailid_regexp)
print first_email_id

When matching with regular expressions, if there is more than one match, the first one is also returned first.

Search by tag attribute value

You can search by the value of the tag's attribute:

search_for_attribute = (id='primaryconsumers')
print search_for_attribute.

Searching by tag attribute values is available for most attributes, such as id, style, and title.

But it will be different for the following two cases:

  • Custom Properties
  • The class attribute

Instead of searching for attribute values directly, we have to use the attrs parameter to pass to thefind() function.

Search based on custom attributes

It is possible to add custom attributes to tags in HTML5, such as adding attributes to tags.

As you can see in the code below, if we were to do the same thing as we did with the id, we would get an error; Python variables can't include the - symbol.

customattr = """
 <p data-custom="custom">custom attribute example</p>
   """
customsoup = BeautifulSoup(customattr,'lxml')
(data-custom="custom")
# SyntaxError: keyword can't be an expression

This time use the attrs attribute value to pass a dictionary type as an argument to search:

using_attrs = (attrs={'data-custom':'custom'})
print using_attrs

Search based on classes in CSS

For CSS class attributes, since class is a keyword in Python, it can't be passed as a tag attribute parameter, in which case it's searched for like a custom attribute. In this case, the search is done like a custom attribute, using the attrs attribute and passing a dictionary to match.

In addition to using the attrs attribute, you can also use the class_ attribute to pass it, which distinguishes it from class and does not lead to errors.

css_class = (attrs={'class':'producerlist'})
css_class2 = (class_ = "producerlist")
print css_class
print css_class2

Searching with customized functions

It is possible to givefind() method is passed a function so that it will search based on the conditions defined by the function.

The function should return either true or false.

def is_producers(tag):
 return tag.has_attr('id') and ('id') == 'producers'
tag_producers = (is_producers)
print tag_producers.

The code defines an is_producers function that checks if the label is specific to the id attribute and if the value of the attribute is equal to producers, and returns true if the condition is met, otherwise it returns false.

Combined use of various search methods

Beautiful Soup provides a variety of search methods, and again, we can combine these methods to make matches and improve the accuracy of our searches.

combine_html = """
 <p class="identical">
  Example of p tag with class identical
 </p>
 <div class="identical">
  Example of div tag with class identical
 <div>
 """
combine_soup = BeautifulSoup(combine_html,'lxml')
identical_div = combine_soup.find("div",class_="identical")
print identical_div

Use the find_all() method to search for

utilizationfind() method returns the first match from the search results, while thefind_all() method returns all matching items.

existfind() method, the same filters used in thefind_all() methods. In fact, they can be used in any search method, for example:find_parents() cap (a poem)find_siblings() In .

# Search all tags with class attribute equal to tertiaryconsumerlist.
all_tertiaryconsumers = soup.find_all(class_='tertiaryconsumerlist')
print type(all_tertiaryconsumers)
for tertiaryconsumers in all_tertiaryconsumers:
 print 

find_all() The method is :

find_all(name,attrs,recursive,text,limit,**kwargs)

It has the same parameters as thefind() method is somewhat similar, with the addition of the limit parameter, which is used to limit the number of results. Thefind() method is 1.

Also, we can pass a string list of parameters to search for tags, tag attribute values, custom attribute values, and CSS classes.

# Search all div and li tags
div_li_tags = soup.find_all(["div","li"])
print div_li_tags
print
# Search all tags whose class attributes are producerlist and primaryconsumerlist
all_css_class = soup.find_all(class_=["producerlist","primaryconsumerlist"])
print all_css_class
print

Search for related tags

In general, we can use thefind() cap (a poem)find_all() method to search for the specified tags, but also to search for other tags of interest related to those tags.

Search for parent tag

It is possible to usefind_parent() orfind_parents() method to search for the parent tag of the label.

find_parent() method will return the first match, while thefind_parents() will return all matches, which is the same as thefind() cap (a poem)find_all() The methodology is similar.

# Searching for parent tags
primaryconsumers = soup.find_all(class_='primaryconsumerlist')
print len(primaryconsumers)
# Take the first tag of the parent
primaryconsumer = primaryconsumers[0]
# Search all parent tags of ul
parent_ul = primaryconsumer.find_parents('ul')
print len(parent_ul)
# The result will contain all the content of the parent tag
print parent_ul
print
# Search for the first occurrence of the parent tag. There are two operations
immediateprimary_consumer_parent = primaryconsumer.find_parent()
# immediateprimary_consumer_parent = primaryconsumer.find_parent('ul')
print immediateprimary_consumer_parent

Search Sibling Tags

Beautiful Soup also provides the ability to search for sibling tags.

Using Functionsfind_next_siblings() function is able to search the next all tags at the same level, while thefind_next_sibling() function can search for the next label at the same level.

producers = (id='producers')
next_siblings = producers.find_next_siblings()
print next_siblings

Similarly, it is possible to use the find_previous_siblings() cap (a poem)find_previous_sibling() method to search for the last tag of the same level.

Search for the next tag

utilizationfind_next() method will search for the first occurrence of the next tag, while thefind_next_all() will return all subordinate labeled items.

# Search the next level of tags
first_div = 
all_li_tags = first_div.find_all_next("li")
print all_li_tags

Search previous tag

Similar to searching for the next tag, usingfind_previous() cap (a poem)find_all_previous() method to search for the previous tag.

summarize

Above is the entire content of this article, I hope that the content of this article for everyone to learn or use python can bring some help, if there are questions you can leave a message to exchange, thank you for my support.