Python uses the lxml library to efficiently process XML and HTML

1. Introduction

In Python development, XML and HTML data are often processed, such as web page data crawling, configuration file parsing, etc. lxml is a powerful and efficient library based on libxml2 and libxslt libraries, providing a simple and easy-to-use API for handling XML and HTML documents. This tutorial will introduce in detail the installation, basic usage methods and some advanced techniques of lxml.

2. Install lxml

You can use pip to install lxml, open the command line tool and execute the following command:

pip install lxml

If you encounter permission issues, you may need to prepend sudo (for Linux or macOS).

3. Parsing XML and HTML documents

3.1 Parsing XML documents

Here is a simple example showing how to parse XML documents using lxml:

from lxml import etree
 
# XML Dataxml_data = '''
&lt;students&gt;
    &lt;student &gt;
        &lt;name&gt;Alice&lt;/name&gt;
        &lt;age&gt;20&lt;/age&gt;
    &lt;/student&gt;
    &lt;student &gt;
        &lt;name&gt;Bob&lt;/name&gt;
        &lt;age&gt;22&lt;/age&gt;
    &lt;/student&gt;
&lt;/students&gt;
'''
 
# parse XML dataroot = (xml_data)
 
# traverse student informationfor student in ('student'):
    student_id = ('id')
    name = ('name').text
    age = ('age').text
    print(f"Student ID: {student_id}, Name: {name}, Age: {age}")

In the above code, first use the () method to parse the XML string into an Element object, and then use the findall() and find() methods to find and access XML elements.

3.2 Parsing HTML documents

lxml can also be used to parse HTML documents, and the following is an example:

from lxml import etree
 
# HTML Datahtml_data = '''
&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
    &lt;title&gt;Example Page&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;h1&gt;Welcome to the Example Page&lt;/h1&gt;
    &lt;p&gt;This is a paragraph.&lt;/p&gt;
&lt;/body&gt;
&lt;/html&gt;
'''
 
# parse HTML dataparser = ()
root = (html_data, parser)
 
# Get title and paragraph texttitle = ('.//title').text
paragraph = ('.//p').text
 
print(f"Title: {title}")
print(f"Paragraph: {paragraph}")

Here we use () to create an HTML parser and pass it to the () method to parse HTML data.

4. Use XPath for data extraction

XPath is a powerful language for locate elements in XML and HTML documents, and lxml provides good support for XPath. Here is an example of extracting data using XPath:

from lxml import etree
 
xml_data = '''
&lt;books&gt;
    &lt;book category="fiction"&gt;
        &lt;title lang="en"&gt;Harry Potter&lt;/title&gt;
        &lt;author&gt;. Rowling&lt;/author&gt;
        &lt;year&gt;2005&lt;/year&gt;
    &lt;/book&gt;
    &lt;book category="non-fiction"&gt;
        &lt;title lang="en"&gt;Python Crash Course&lt;/title&gt;
        &lt;author&gt;Eric Matthes&lt;/author&gt;
        &lt;year&gt;2015&lt;/year&gt;
    &lt;/book&gt;
&lt;/books&gt;
'''
 
root = (xml_data)
 
# Use XPath to extract titles of all bookstitles = ('//book/title/text()')
for title in titles:
    print(title)
 
# Use XPath to extract authors of books with category "non-fiction"authors = ('//book[@category="non-fiction"]/author/text()')
for author in authors:
    print(author)

In this example, the xpath() method is used to execute an XPath expression, returning elements or text that meet the criteria.

5. Create and modify XML documents

5.1 Creating XML Documents

from lxml import etree
 
# Create root elementroot = ('employees')
 
# Create child elementsemployee1 = (root, 'employee', id='1')
name1 = (employee1, 'name')
 = 'John Doe'
age1 = (employee1, 'age')
 = '30'
 
employee2 = (root, 'employee', id='2')
name2 = (employee2, 'name')
 = 'Jane Smith'
age2 = (employee2, 'age')
 = '25'
 
# Write XML documents to a filetree = (root)
('', encoding='utf-8', xml_declaration=True, pretty_print=True)

The above code shows how to create an XML document using lxml and save it to a file.

5.2 Modify XML documents

from lxml import etree
 
# parse XML filestree = ('')
root = ()
 
# Modify employee informationfor employee in ('employee'):
    if ('id') == '1':
        age = ('age')
         = '31'
 
# Write the modified XML document back to the file('', encoding='utf-8', xml_declaration=True, pretty_print=True)

Here we first parse an XML file, then find the elements that need to be modified and update their contents, and finally write the modified document back to the file.

6. Advanced skills

6.1 Handling large XML files

For large XML files, you can use the iterparse() method for line-by-line parsing to avoid loading the entire file into memory:

from lxml import etree
 
context = ('large_file.xml', events=('end',), tag='record')
for event, element in context:
    # Process each record    print(('field').text)
    # Release elements to save memory    ()
    while () is not None:
        del ()[0]

6.2 Processing HTML form data

When processing HTML form data, you can use lxml to extract form fields and values:

from lxml import etree
 
html_data = '''
<form action="/submit" method="post">
    <input type="text" name="username" value="john_doe">
    <input type="password" name="password" value="secret">
    <input type="submit" value="Submit">
</form>
'''
 
parser = ()
root = (html_data, parser)
 
form_fields = {}
for input_element in ('.//input'):
    name = input_element.get('name')
    value = input_element.get('value')
    if name and value:
        form_fields[name] = value
 
print(form_fields)

7. Summary

lxml is a powerful, efficient and easy to use Python library for processing XML and HTML data. With this tutorial, you learned how to install lxml, parse and create XML/HTML documents, use XPath for data extraction, and some advanced tips.

This is the end of this article about Python using the lxml library to efficiently process XML and HTML. For more related Python lxml processing XML and HTML content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!