导致UnicodeDecodeError: 'utf-8' codec不能解码字节

这是我的代码，

for line in open('u.item'):
# Read each line

每当我运行这段代码，它给出以下错误:

UnicodeDecodeError: 'utf-8' codec无法解码字节0xe9在位置2892:无效的延续字节

我试图解决这个问题，并在open()中添加了一个额外的参数。代码如下:

for line in open('u.item', encoding='utf-8'):
# Read each line

但是它又给出了同样的错误。那我该怎么办呢?

当前回答

你可以用以下方法来解决这个问题:

for line in open(your_file_path, 'rb'):

'rb'以二进制模式读取文件。点击这里阅读更多。

2019-05-02 02:15:15

其他回答

我一直遇到这个错误，通常解决方案不是通过encoding='utf-8'解决的，而是实际上使用engine='python'，就像这样:

import pandas as pd

file = "c:\\path\\to_my\\file.csv"
df = pd.read_csv(file, engine='python')
df

文档的链接在这里:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

2022-06-09 07:08:09

我正在使用从Kaggle下载的数据集，同时读取这个数据集，它抛出了这个错误:

UnicodeDecodeError: 'utf-8'编解码器不能解码字节0xf1在位置 183:无效的延续字节

这就是我解决问题的方法。

import pandas as pd

pd.read_csv('top50.csv', encoding='ISO-8859-1')

2021-10-13 12:46:05

您的文件实际上并不包含UTF-8编码的数据;它包含一些其他编码。弄清楚这种编码是什么，并在开放呼叫中使用它。

例如，在Windows-1252编码中，0xe9将是字符é。

2013-10-31 05:58:23

基于Stackoverflow上的另一个问题和本文之前的回答，我想添加一个帮助来找到正确的编码。

如果你的脚本运行在Linux操作系统上，你可以通过file命令获取编码:

file --mime-encoding <filename>

下面是一个python脚本来为你做这件事:

import sys
import subprocess

if len(sys.argv) < 2:
    print("Usage: {} <filename>".format(sys.argv[0]))
    sys.exit(1)

def find_encoding(fname):
    """Find the encoding of a file using file command
    """

    # find fullname of file command
    which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
    if which_run.returncode != 0:
        print("Unable to find 'file' command ({})".format(which_run.returncode))
        return None

    file_cmd = which_run.stdout.decode().replace('\n', '')

    # run file command to get MIME encoding
    file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
                               stdout=subprocess.PIPE,
                               stderr=subprocess.PIPE)
    if file_run.returncode != 0:
        print(file_run.stderr.decode(), file=sys.stderr)

    # return  encoding name only
    return file_run.stdout.decode().split()[1]

# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))

2021-08-30 05:19:54

我的问题类似于UTF-8文本被传递给Python脚本。

在我的例子中，它来自SQL Server机器学习服务中使用sp_execute_external_script的SQL。不管出于什么原因，VARCHAR数据似乎被作为UTF-8传递，而NVARCHAR数据被作为UTF-16传递。

由于无法在Python中指定默认编码，并且没有用户可编辑的Python语句解析数据，所以我不得不在@input_data参数中的SELECT查询中使用SQL CONVERT()函数。

当这个查询

EXEC sp_execute_external_script @language = N'Python', 
@script = N'
OutputDataSet = InputDataSet
', 
@input_data_1 = N'SELECT id, text FROM the_error;'
WITH RESULT SETS (([id] int, [text] nvarchar(max)));

给出错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 0: unexpected end of data

使用CONVERT(类型，数据)(CAST(数据AS类型)也可以)

EXEC sp_execute_external_script @language = N'Python', 
@script = N'
OutputDataSet = InputDataSet
', 
@input_data_1 = N'SELECT id, CONVERT(NVARCHAR(max), text) FROM the_error;'
WITH RESULT SETS (([id] INT, [text] NVARCHAR(max)));

id  text
1   Ç

2022-09-28 16:04:40

导致UnicodeDecodeError: 'utf-8' codec不能解码字节

推荐文章

最新文章

标签