我正在运行一个程序,它正在处理3万个类似的文件。随机数量的它们停止并产生此错误…

  File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
    data = pd.read_csv(filepath, names=fields)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
    return parser.read()
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
  File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
  File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
  File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
  File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
  File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
  File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
  File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

这些文件的来源/创建都来自同一个地方。纠正这个问题以继续导入的最佳方法是什么?


当前回答

Pandas允许指定编码,但不允许忽略错误,不允许自动替换违规字节。因此,没有一种适合所有情况的方法,而是根据实际用例使用不同的方法。

You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding: file_encoding = 'cp1252' # set file_encoding to the file encoding (utf8, latin1, etc.) pd.read_csv(input_file_and_path, ..., encoding=file_encoding) You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code): pd.read_csv(input_file_and_path, ..., encoding='latin1') You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence: file_encoding = 'utf8' # set file_encoding to the file encoding (utf8, latin1, etc.) input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace') pd.read_csv(input_fd, ...)

其他回答

我无法打开从网上银行下载的简体中文CSV文件。 我试过latin1,我试过iso-8859-1,我试过cp1252,都没有用。

但pd。Read_csv ("",encoding ='gbk')只是做这个工作。

我正在更新这个旧线程。我找到了一个有效的解决方案,但需要打开每个文件。我在LibreOffice中打开我的csv文件,选择另存为>编辑过滤器设置。在下拉菜单中,我选择UTF8编码。然后我在data = pd.read_csv(r' c:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig")中添加了encoding="utf-8-sig"。

希望这能帮助到一些人。

我正在使用Jupyter-notebook。在我的例子中,它以错误的格式显示文件。“编码”选项不起作用。 所以我将csv保存为utf-8格式,它可以工作。

Pandas不会通过更改编码样式自动替换违规字节。在我的例子中,将编码参数从encoding = "utf-8"更改为encoding = "utf-16"解决了这个问题。

有时问题只出现在.csv文件上。文件可能已损坏。 当面对这个问题时。“另存为”文件再次为csv格式。

0. Open the xls/csv file
1. Go to -> files 
2. Click -> Save As 
3. Write the file name 
4. Choose 'file type' as -> CSV [very important]
5. Click -> Ok