为什么下面的项目失败了?为什么它成功与“拉丁-1”编解码器?

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

结果是:

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:\Python27\lib\encodings\utf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

当前回答

如果在操作刚打开的文件时出现此错误,请检查是否以'rb'模式打开

其他回答

我遇到了这个问题,原来我直接从谷歌表文件中保存了我的CSV。换句话说,我在一个谷歌表文件中。我选择,保存一个副本,然后当我的浏览器下载它时,我选择了打开。然后直接保存了CSV。这是错误的一步。

对我来说,解决这个问题的方法是首先在我的本地电脑上将表格保存为.xlsx文件,然后将表格导出为.csv文件。然后,pd.read_csv('myfile.csv')的错误消失了

TLDR:我建议在切换编码器以消除错误之前深入调查问题的根源。

我得到这个错误,因为我正在处理大量的zip文件,其中有额外的zip文件。

我的工作流程如下:

读取zip 读取子zip 读取子zip中的文本

At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.

当我试图通过pandas.read_csv打开CSV文件时,我遇到了同样的错误 方法。

解决方案是将编码改为latin-1:

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

解决方案改为“UTF-8 sin BOM”

因为UTF-8是多字节的,并且没有对应于\xe9加上后面空格的组合的字符。

为什么它在utf-8和latin-1中都能成功?

下面是同一句话在utf-8中的用法:

>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'