我正在运行一个程序,它正在处理3万个类似的文件。随机数量的它们停止并产生此错误…

  File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
    data = pd.read_csv(filepath, names=fields)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
    return parser.read()
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
  File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
  File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
  File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
  File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
  File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
  File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
  File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

这些文件的来源/创建都来自同一个地方。纠正这个问题以继续导入的最佳方法是什么?


当前回答

Pandas允许指定编码,但不允许忽略错误,不允许自动替换违规字节。因此,没有一种适合所有情况的方法,而是根据实际用例使用不同的方法。

You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding: file_encoding = 'cp1252' # set file_encoding to the file encoding (utf8, latin1, etc.) pd.read_csv(input_file_and_path, ..., encoding=file_encoding) You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code): pd.read_csv(input_file_and_path, ..., encoding='latin1') You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence: file_encoding = 'utf8' # set file_encoding to the file encoding (utf8, latin1, etc.) input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace') pd.read_csv(input_fd, ...)

其他回答

我正在更新这个旧线程。我找到了一个有效的解决方案,但需要打开每个文件。我在LibreOffice中打开我的csv文件,选择另存为>编辑过滤器设置。在下拉菜单中,我选择UTF8编码。然后我在data = pd.read_csv(r' c:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig")中添加了encoding="utf-8-sig"。

希望这能帮助到一些人。

Pandas不会通过更改编码样式自动替换违规字节。在我的例子中,将编码参数从encoding = "utf-8"更改为encoding = "utf-16"解决了这个问题。

在我的例子中,这适用于python 2.7:

data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False) 

对于python3,只有:

data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False) 
with open('filename.csv') as f:
   print(f)

执行这段代码后,你会发现'filename.csv'的编码,然后执行如下代码

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

好了

Pandas允许指定编码,但不允许忽略错误,不允许自动替换违规字节。因此,没有一种适合所有情况的方法,而是根据实际用例使用不同的方法。

You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding: file_encoding = 'cp1252' # set file_encoding to the file encoding (utf8, latin1, etc.) pd.read_csv(input_file_and_path, ..., encoding=file_encoding) You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code): pd.read_csv(input_file_and_path, ..., encoding='latin1') You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence: file_encoding = 'utf8' # set file_encoding to the file encoding (utf8, latin1, etc.) input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace') pd.read_csv(input_fd, ...)