Python information extraction of garbled code solution
I'll tell you what I've encountered, so pass by if you're not like me, and read if you are!
information capture, with python, beautifulSoup, lxml, re, urllib2, urllib2 to get want to extract the content of the page, and then use lxml or beautifulSoup for parsing, inserting mysql Specific content, well seems to be very simple and very easy to look at, but inside the disgusting place on the to come, first, the domestic development of the site in the specified site code or save the site source code does not take into account what the code, anyway, in a word, a website even if you use tools to view or view the source dock information to see their source code is utf-8, or GBK and so on, do not believe that, hey, what is believed to be a disaster, namely, the<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Some of the processes are given below: (specific individual libraries are not what I am talking about here)
import urllib2 import chardet html = ("A website") print (html) # Here it will output a dictionary {'a':0.99999,'encoding':'utf-8'}
Well, this whole html encoding know, it should be inserted into the utf-8 mysql database, but I inserted the time of the error occurred, because I use lxml after the string is not utf-8, but Big5 (Traditional Chinese encoding), and a variety of unknown encoding EUC-JP (Japanese encoding), OK, I took the unicode approach! I decode the field first, and then encode it.
if (name)['encoding'] == 'GB2312': name = unicode(name,'GB2312','ignore').encode('utf-8','ignore') elif (name)['encoding'] == 'Big5': name = unicode(name,'Big5','ignore').encode('utf-8','ignore') elif (name)['encoding'] == 'ascii': name = unicode(name,'ascii','ignore').encode('utf-8','ignore') elif (name)['encoding'] == 'GBK': name = unicode(name,'GBK','ignore').encode('utf-8','ignore') elif (name)['encoding'] == 'EUC-JP': name = unicode(name,'EUC-JP','ignore').encode('utf-8','ignore') else: name = 'Unknown'
Thanks for reading, I hope this helps, and thanks for supporting this site!