我收到了一些编码的文本,但我不知道使用的是什么字符集。是否有一种方法可以使用Python确定文本文件的编码?如何检测文本文件的编码/代码页处理c#。


当前回答

在一般情况下,原则上不可能确定文本文件的编码。所以没有标准的Python库来帮你做这个。

如果您对文本文件有更具体的了解(例如,它是XML),可能会有库函数。

其他回答

在一般情况下,原则上不可能确定文本文件的编码。所以没有标准的Python库来帮你做这个。

如果您对文本文件有更具体的了解(例如,它是XML),可能会有库函数。

这可能会有帮助

from bs4 import UnicodeDammit
with open('automate_data/billboard.csv', 'rb') as file:
   content = file.read()

suggestion = UnicodeDammit(content)
suggestion.original_encoding
#'iso-8859-1'

如果你知道文件的一些内容,你可以尝试用几种编码来解码它,看看哪个丢失了。一般来说没有办法,因为文本文件就是文本文件,这些都是愚蠢的;)

很久以前,我有这样的需求。

阅读我的旧代码,我发现了这个:

    import urllib.request
    import chardet
    import os
    import settings

    [...]
    file = 'sources/dl/file.csv'
    media_folder = settings.MEDIA_ROOT
    file = os.path.join(media_folder, str(file))
    if os.path.isfile(file):
        file_2_test = urllib.request.urlopen('file://' + file).read()
        encoding = (chardet.detect(file_2_test))['encoding']
        return encoding

这为我工作,并返回ascii

编辑:chardet似乎无人维护,但大部分答案适用。请登录https://pypi.org/project/charset-normalizer/查看其他选择

始终正确地检测编码是不可能的。

(来自chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

有一个chardet库利用这项研究来检测编码。chardet是Mozilla中自动检测代码的一个端口。

你也可以使用UnicodeDammit。它将尝试以下方法:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document. An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII. An encoding sniffed by the chardet library, if you have it installed. UTF-8 Windows-1252