在Pandas with Python中读取CSV文件时出现UnicodeDecodeError

我正在运行一个程序，它正在处理3万个类似的文件。随机数量的它们停止并产生此错误…

  File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
    data = pd.read_csv(filepath, names=fields)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
    return parser.read()
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
  File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
  File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
  File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
  File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
  File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
  File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
  File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

这些文件的来源/创建都来自同一个地方。纠正这个问题以继续导入的最佳方法是什么?

当前回答

I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If you have special characters like La Cañada Flintridge city, well unless you are exporting the data using UTF-8 encoding, you're going to introduce errors. La Cañada Flintridge city will become La Ca\xf1ada Flintridge city. If you are using pandas.read_csv without any adjustments to the default parameters, you'll hit the following error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte

幸运的是，有一些解决方案。

选项1，修复导出。确保使用UTF-8编码。

选项2，如果您无法修复导出问题，并且需要使用pandas。Read_csv，请确保包含以下参数，engine='python'。默认情况下，pandas使用engine='C'，这非常适合读取大的干净文件，但如果出现任何意外情况，则会崩溃。根据我的经验，设置encoding='utf-8'从来没有修复过这个UnicodeDecodeError。此外，您不需要使用errors_bad_lines，但是，如果您确实需要它，这仍然是一个选项。

pd.read_csv(<your file>, engine='python')

选项3:解决方案是我个人更喜欢的解决方案。使用普通Python读取文件。

import pandas as pd

data = []

with open(<your file>, "rb") as myfile:
    # read the header seperately
    # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
    header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
    # read the rest of the data
    for line in myfile:
        row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
        data.append(row)

# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)

希望这对第一次遇到这个问题的人有所帮助。

2019-10-22 17:12:38

其他回答

在我的例子中，这适用于python 2.7:

data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False)

对于python3，只有:

data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False)

2019-07-26 15:49:38

Pandas允许指定编码，但不允许忽略错误，不允许自动替换违规字节。因此，没有一种适合所有情况的方法，而是根据实际用例使用不同的方法。

You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding: file_encoding = 'cp1252' # set file_encoding to the file encoding (utf8, latin1, etc.) pd.read_csv(input_file_and_path, ..., encoding=file_encoding) You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code): pd.read_csv(input_file_and_path, ..., encoding='latin1') You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence: file_encoding = 'utf8' # set file_encoding to the file encoding (utf8, latin1, etc.) input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace') pd.read_csv(input_fd, ...)

2018-08-09 09:42:15

这个问题困扰了我一段时间，我想我应该发布这个问题，因为它是第一个搜索结果。将encoding="iso-8859-1"标签添加到pandas read_csv中不起作用，任何其他编码也不起作用，一直给出UnicodeDecodeError。

如果将文件句柄传递给pd.read_csv()，则需要将encoding属性放在打开的文件上，而不是在read_csv中。事后看来很明显，但这是一个需要追查的微妙错误。

2018-05-13 17:12:35

尝试指定engine='python'。它对我很有效，但我仍在试图弄清楚为什么。

df = pd.read_csv(input_file_path,...engine='python')

2019-01-31 02:52:12

你可以试试这个。

import csv
import pandas as pd
df = pd.read_csv(filepath,encoding='unicode_escape')

2020-06-14 09:33:29

在Pandas with Python中读取CSV文件时出现UnicodeDecodeError

推荐文章

最新文章

标签