lxml is a python parsing library that supports HTML and XML parsing, XPath parsing, and very efficient parsing.
XPath, full name XML Path Language, that is, XML path language, it is a language to find information in XML documents, it is initially used to search for XML documents, but it is also applicable to the search of HTML documents
XPath's selection function is very powerful, it provides very concise path selection expressions, in addition, it provides more than 100 built-in functions for string, numeric, and time matching, as well as node and sequence processing, etc. Almost any node we want to locate can be selected using XPath
XPath became a W3C standard on November 16, 1999, and is designed to be used by XSLT, XPointer, and other XML parsing software; more documentation can be found on its official website:https:///TR/xpath/
Problematic situation:
response = (url=url, headers=headers).text html = (response) name = ("/html/body/div[2]/ul/li[1]/a/p/text()")[0] print(name)
The data can be fetched normally, but the result is
å·²éªè¯ å®å ¨ ç¾ç
It's such a mess.
Solution:
name = ("/html/body/div[2]/ul/li[1]/a/p/text()")[0].encode('ISO-8859-1').decode('UTF-8')
UTF-8 on this side depends on the encoding of the page
Look at the coding of the page
F12
This is the whole content of this article.