为什么下面的项目失败了?为什么它成功与“拉丁-1”编解码器?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
结果是:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte
TLDR:我建议在切换编码器以消除错误之前深入调查问题的根源。
我得到这个错误,因为我正在处理大量的zip文件,其中有额外的zip文件。
我的工作流程如下:
读取zip
读取子zip
读取子zip中的文本
At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.
在二进制中,0xE9看起来像1110 1001。如果您在Wikipedia上阅读有关UTF-8的内容,就会看到这样的字节后面必须跟两个10xx xxxx形式的字节。举个例子:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
但这只是异常的机械原因。在本例中,您的字符串几乎肯定是用latin 1编码的。你可以看到UTF-8和latin 1看起来有什么不同:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(注意,这里我混合使用了Python 2和Python 3的表示法。输入在任何版本的Python中都是有效的,但Python解释器不太可能以这种方式同时显示unicode和字节字符串。)