SoFunction
Updated on 2024-11-20

Python crawler based on lxml to solve the data encoding garbled problem

lxml is a python parsing library that supports HTML and XML parsing, XPath parsing, and very efficient parsing.

XPath, full name XML Path Language, that is, XML path language, it is a language to find information in XML documents, it is initially used to search for XML documents, but it is also applicable to the search of HTML documents

XPath's selection function is very powerful, it provides very concise path selection expressions, in addition, it provides more than 100 built-in functions for string, numeric, and time matching, as well as node and sequence processing, etc. Almost any node we want to locate can be selected using XPath

XPath became a W3C standard on November 16, 1999, and is designed to be used by XSLT, XPointer, and other XML parsing software; more documentation can be found on its official website:https:///TR/xpath/

Problematic situation:

response = (url=url, headers=headers).text
html = (response)
name = ("/html/body/div[2]/ul/li[1]/a/p/text()")[0]
print(name)

The data can be fetched normally, but the result is

已验证 安全 盾牌

It's such a mess.

Solution:

name = ("/html/body/div[2]/ul/li[1]/a/p/text()")[0].encode('ISO-8859-1').decode('UTF-8')

UTF-8 on this side depends on the encoding of the page

Look at the coding of the page

F12

This is the whole content of this article.