SoFunction
Updated on 2024-11-14

Python2 Chinese processing summary of the implementation of the method

python2 is not unicode as the basic code character type, the chance of encountering garbled code is much higher than python3, but even so, I believe that many people, do not want to arbitrarily migrate to python3, here summarizes a few of my usual problems and solutions.

1, the document can not be used in Chinese comments

Treatment:

Add to the code # -*- coding=UTF-8 -*- , generally added to the first line of the file header, or on the second line if the first line is a script flag (which is still actually the first line of the python proper).

The file was then saved as a UTF-8 file.

This method can solve the problem of Chinese in comments, and the immediate number of strings containing Chinese.

2, unicode Chinese variable print out is garbled

Treatment:

Add the following 3 lines of code to the section of the file that begins the introduction of the extension library.

import sys
reload(sys)
('utf-8')

3、utf-8 and gbk mutual conversion

Look directly at the code:

#utf-8 strings converted to GBK (GB2312 and other encodings are also used in this way)
print ('UTF-8').encode('GBK')
#gbk to utf-8 conversion
print ('GBK').encode('UTF-8')

4. Is utf-8 in upper or lower case in the parameter?

Usually case is fine, it's not up to python, it's up to the system's language code settings.

5, open utf-8 text file

After 1, 2 settings, normal direct open can be, the file is what encoding, read out is what encoding, individual still can not use the extension library codecs:

import codecs
...
with (poetry_file, "r","utf-8") as f:

6, print print out the structure of the Chinese characters is garbled

It is not a problem for print to print only a utf-8 variable, for example

a="Kanji."
print a
#will be displayed normally

But if a succession display is used, for example:

print a,
#will display a garbled code

If it is any other structure, such as dict / list / class, it will be garbled.

a = ["Chinese","Testing."]
print a
#will display a garbled code

There is no good way to use the basic library in this case, except to loop through and print the contents one by one, for example:

...
for item in items:
print item

Or integrate the output, for example: print ', '.join(a)

It is also possible to use third-party packages, for example:

import uniout
...
listnine = ['Pear', 'Tangerine', 'Apple', 'Banana']
print 'listnine list: %s' % listnine

7. The variable itself is displayed normally, but the individual characters traversed by the loop are garbled.

Most of the time this is because the string is not unicode encoded. When declaring a string use thea = u'kanji'Variables assigned this way are Unicode strings and will not be a problem.

If the variable is passed in from outside and the source situation is not known, try converting it to a Unicode string:

str=unicode(str,"utf-8");

Well, that's pretty much it, I'll add more as I think of it.

This is the whole content of this article.