For example, to convert a String object s from gbk internal code to UTF-8, you can do the following
('gbk').encode('utf-8′)
However, in actual development, I've found that this approach often leads to anomalies:
UnicodeDecodeError: ‘gbk' codec can't decode bytes in position 30664-30665: illegal multibyte sequence
This is because illegal characters have been encountered - especially in some programs written in C/C++, full-width spaces are often implemented in different ways, such as \xa3\xa0, or \xa4\x57, which appear to be full-width spaces, but they are not "legal" (the real full-width space is \xa1\xa1), so an exception occurs during the transcoding process. legal" full-width spaces (the real full-width space is \xa1\xa1), so an exception occurs in the transcoding process.
Problems like this are a headache, because as soon as an illegal character appears in a string, the entire string - and sometimes, the entire article - can't be transcoded.
Solution:
('gbk', ‘ignore').encode('utf-8′)
Because the prototype of the decode function is decode([encoding], [errors='strict']), you can control the error-handling strategy with the second parameter, which by default is strict, which means that an exception is thrown when an illegal character is encountered;
If set to ignore, illegal characters are ignored;
If set to replace, it will replace illegal characters with ? replaces illegal characters;
If set to xmlcharrefreplace, the XML character reference is used.
python documentation
decode( [encoding[, errors]])
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. errors may be given to set a different error handling scheme. The default is 'strict', meaning that encoding errors raise UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error, see section 4.8.1.
('gbk').encode('utf-8′)
However, in actual development, I've found that this approach often leads to anomalies:
UnicodeDecodeError: ‘gbk' codec can't decode bytes in position 30664-30665: illegal multibyte sequence
This is because illegal characters have been encountered - especially in some programs written in C/C++, full-width spaces are often implemented in different ways, such as \xa3\xa0, or \xa4\x57, which appear to be full-width spaces, but they are not "legal" (the real full-width space is \xa1\xa1), so an exception occurs during the transcoding process. legal" full-width spaces (the real full-width space is \xa1\xa1), so an exception occurs in the transcoding process.
Problems like this are a headache, because as soon as an illegal character appears in a string, the entire string - and sometimes, the entire article - can't be transcoded.
Solution:
('gbk', ‘ignore').encode('utf-8′)
Because the prototype of the decode function is decode([encoding], [errors='strict']), you can control the error-handling strategy with the second parameter, which by default is strict, which means that an exception is thrown when an illegal character is encountered;
If set to ignore, illegal characters are ignored;
If set to replace, it will replace illegal characters with ? replaces illegal characters;
If set to xmlcharrefreplace, the XML character reference is used.
python documentation
decode( [encoding[, errors]])
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. errors may be given to set a different error handling scheme. The default is 'strict', meaning that encoding errors raise UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error, see section 4.8.1.