1. How important is XML file parsing
Suppose you receive an XML file like this:
<bookstore> <book category="programming"> <title>PythonFrom Beginner to Mastery</title> <author>Zhang Wei</author> <year>2023</year> </book> <book category="novel"> <title>Three-body</title> <author>Liu Cixin</author> <year>2008</year> </book> </bookstore>
All the titles and author information need to be extracted, what would you do? Manual copy and paste? This obviously doesn't work when the file has a few hundred MB! Python's ElementTree module was created to solve this kind of problem.
2. Get started with ElementTree
1. Two ways to load XML
Method 1: Direct parse strings
import as ET xml_string = """ <bookstore> <book category="programming"> <title>PythonFrom Beginner to Mastery</title> <author>Zhang Wei</author> </book> </bookstore> """ root = (xml_string) # Load from string
Method 2: Read XML file
tree = ('') # Load from fileroot = ()
2. Traversing XML nodes
Get all book nodes:
for book in ('book'): print("Find a book:") print(f"category:{('category')}") print(f"Book title:{('title').text}") print(f"author:{('author').text}")
Output result:
Found a book:
Category: Programming
Book title: Python from Beginner to Mastery
Author: Zhang Wei
Found a book:
Category: Novel
Book title: Three Bodies
Author: Liu Cixin
3. Detailed explanation of ElementTree core operation
1. Three ways to find elements
# Find the first matching nodefirst_book = ('book') # Find all matching nodesall_books = ('book') # Use XPath to find (more powerful)titles = ('.//title') # Find all title nodes
2. Get node properties and text
# Get attributescategory = ('category') # Get text contenttitle = ('title').text # Handle nodes that may not existyear = ('year') if year is not None: print()
3. Handle namespaces
What should I do if I encounter XML with namespace?
<ns:book xmlns:ns=""> <ns:title>XMLAnalysis Guide</ns:title> </ns:book>
Analysis method:
ns = {'ns': ''} title = ('ns:title', ns).text
4. Practical combat: parse real scene XML
Suppose you want to process an RSS feed (actually XML format):
import requests url = "/rss" response = (url) root = () for item in ('.//item'): print(f"title:{('title').text}") print(f"Link:{('link').text}") print("----")
5. Performance optimization skills
When processing large XML files (such as hundreds of MB):
1. Use iterative analysis
for event, elem in ('big_file.xml'): if == 'book': print(('title').text) () # Clean the memory in time
2. Use lxml to speed up
from lxml import etree # Need to install: pip install lxml # 3-5 times faster than the standard libraryparser = (remove_blank_text=True) tree = ('', parser)
6. Frequently Asked Questions
Question 1: What to do if the encoding is incorrect?
with open('', 'r', encoding='utf-8') as f: tree = (f)
Question 2: Handling special characters
from import escape safe_text = escape('Text & Special Characters<>"')
Question 3: Beautify the output
from import minidom xml_str = (root) pretty_xml = (xml_str).toprettyxml()
7. Complete code example
import as ET def parse_xml(file_path): tree = (file_path) root = () results = [] for book in ('book'): data = { 'category': ('category'), 'title': ('title').text, 'author': ('author').text, 'year': ('year').text if ('year') is not None else None } (data) return results #User Examplebooks = parse_xml('') for book in books: print(f"{book['title']}({book['year']})")
8. Summary
ElementTree is the preferred tool for Python to handle XML because it:
- Simple and easy to use: a few lines of code can parse complex XML
- Full functions: Supports advanced features such as XPath and namespace
- Good performance: can handle GB-level files with lxml
Remember these key points:
- Use () for small files
- Large files use ()
- High performance requirements use lxml
This is the end of this article about Python using ElementTree to quickly parse XML files. For more related Python ElementTree to parse XML content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!