在Pandas with Python中读取CSV文件时出现UnicodeDecodeError

我正在运行一个程序，它正在处理3万个类似的文件。随机数量的它们停止并产生此错误…

  File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
    data = pd.read_csv(filepath, names=fields)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
    return parser.read()
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
  File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
  File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
  File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
  File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
  File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
  File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
  File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

这些文件的来源/创建都来自同一个地方。纠正这个问题以继续导入的最佳方法是什么?

当前回答

在我的案例中，我无法使用之前提供的任何方法来克服这个问题。将编码器类型更改为utf-8、utf-16、iso-8859-1或任何其他类型都不工作。

但是不用pd。read_csv(文件名，分隔符=';')，我用的是;

pd。Read_csv (open(filename， 'r')， delimiter=';')

一切似乎都很顺利。

2021-12-28 08:26:37

其他回答

这个答案似乎是CSV编码问题的万能答案。如果你的头文件出现了奇怪的编码问题，就像这样:

>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])

然后在CSV文件的开头有一个字节顺序标记(BOM)字符。这个答案解决了这个问题:

Python读取csv - BOM嵌入到第一个键

解决方案是用encoding="utf-8-sig"加载CSV:

>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])

希望这能帮助到一些人。

2018-12-17 18:14:28

Pandas允许指定编码，但不允许忽略错误，不允许自动替换违规字节。因此，没有一种适合所有情况的方法，而是根据实际用例使用不同的方法。

You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding: file_encoding = 'cp1252' # set file_encoding to the file encoding (utf8, latin1, etc.) pd.read_csv(input_file_and_path, ..., encoding=file_encoding) You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code): pd.read_csv(input_file_and_path, ..., encoding='latin1') You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence: file_encoding = 'utf8' # set file_encoding to the file encoding (utf8, latin1, etc.) input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace') pd.read_csv(input_fd, ...)

2018-08-09 09:42:15

我正在更新这个旧线程。我找到了一个有效的解决方案，但需要打开每个文件。我在LibreOffice中打开我的csv文件，选择另存为>编辑过滤器设置。在下拉菜单中，我选择UTF8编码。然后我在data = pd.read_csv(r' c:\fullpathtofile\filename.csv'， sep = '，'， encoding="utf-8-sig")中添加了encoding="utf-8-sig"。

希望这能帮助到一些人。

2019-01-01 00:54:23

这个问题困扰了我一段时间，我想我应该发布这个问题，因为它是第一个搜索结果。将encoding="iso-8859-1"标签添加到pandas read_csv中不起作用，任何其他编码也不起作用，一直给出UnicodeDecodeError。

如果将文件句柄传递给pd.read_csv()，则需要将encoding属性放在打开的文件上，而不是在read_csv中。事后看来很明显，但这是一个需要追查的微妙错误。

2018-05-13 17:12:35

请尝试添加

import pandas as pd
df = pd.read_csv('file.csv', encoding='unicode_escape')

这将有所帮助。为我工作。另外，请确保使用了正确的分隔符和列名。

为了快速加载文件，可以从只加载1000行开始。

2020-05-21 05:39:51

在Pandas with Python中读取CSV文件时出现UnicodeDecodeError

推荐文章

最新文章

标签