如何确定文本的编码

我收到了一些编码的文本，但我不知道使用的是什么字符集。是否有一种方法可以使用Python确定文本文件的编码?如何检测文本文件的编码/代码页处理c#。

当前回答

一些文本文件知道它们的编码，大多数则不是。意识到:

具有BOM的文本文件 XML文件以UTF-8编码或其编码在序言中给出 JSON文件总是用UTF-8编码

没有意识到:

CSV文件任意文本文件

有些编码是通用的，即它们可以解码任何字节序列，有些则不是。US-ASCII不是万能的，因为任何大于127的字节都不能映射到任何字符。UTF-8不是万能的，因为任何字节序列都是无效的。

相反，Latin-1, Windows-1252等是通用的(即使一些字节没有正式映射到一个字符):

>>> [b.to_bytes(1, 'big').decode("latin-1") for b in range(256)]
['\x00', ..., 'ÿ']

给定一个以字节序列编码的随机文本文件，除非该文件知道其编码，否则无法确定其编码，因为有些编码是通用的。但有时可以排除非通用编码。所有通用编码仍然是可能的。chardet模块使用字节的频率来猜测哪种编码最适合已编码的文本。

如果你不想使用这个模块或类似的模块，这里有一个简单的方法:

检查文件是否知道其编码(BOM) 检查非通用编码并接受第一个可以解码字节的编码(ASCII在UTF-8之前，因为它更严格) 选择一个回退编码。

如果您只检查一个示例，那么第二步有点风险，因为文件其余部分中的某些字节可能是无效的。

代码:

def guess_encoding(data: bytes, fallback: str = "iso8859_15") -> str:
    """
    A basic encoding detector.
    """
    for bom, encoding in [
        (codecs.BOM_UTF32_BE, "utf_32_be"),
        (codecs.BOM_UTF32_LE, "utf_32_le"),
        (codecs.BOM_UTF16_BE, "utf_16_be"),
        (codecs.BOM_UTF16_LE, "utf_16_le"),
        (codecs.BOM_UTF8, "utf_8_sig"),
    ]:
        if data.startswith(bom):
            return encoding

    if all(b < 128 for b in data):
        return "ascii"  # you may want to use the fallback here if data is only a sample.

    decoder = codecs.getincrementaldecoder("utf_8")()
    try:
        decoder.decode(data, final=False)
    except UnicodeDecodeError:
        return fallback
    else:
        return "utf_8"  # not certain if data is only a sample

记住，非通用编码可能会失败。decode方法的errors参数可以设置为'ignore'， 'replace'或'backslashreplace'以避免异常。

2022-05-27 07:38:53

其他回答

使用linux file -i命令

import subprocess

file = "path/to/file/file.txt"

encoding =  subprocess.Popen("file -bi "+file, shell=True, stdout=subprocess.PIPE).stdout

encoding = re.sub(r"(\\n)[^a-z0-9\-]", "", str(encoding.read()).split("=")[1], flags=re.IGNORECASE)
    
print(encoding)

2021-05-06 14:59:27

一些文本文件知道它们的编码，大多数则不是。意识到:

具有BOM的文本文件 XML文件以UTF-8编码或其编码在序言中给出 JSON文件总是用UTF-8编码

没有意识到:

CSV文件任意文本文件

相反，Latin-1, Windows-1252等是通用的(即使一些字节没有正式映射到一个字符):

>>> [b.to_bytes(1, 'big').decode("latin-1") for b in range(256)]
['\x00', ..., 'ÿ']

如果你不想使用这个模块或类似的模块，这里有一个简单的方法:

检查文件是否知道其编码(BOM) 检查非通用编码并接受第一个可以解码字节的编码(ASCII在UTF-8之前，因为它更严格) 选择一个回退编码。

如果您只检查一个示例，那么第二步有点风险，因为文件其余部分中的某些字节可能是无效的。

代码:

def guess_encoding(data: bytes, fallback: str = "iso8859_15") -> str:
    """
    A basic encoding detector.
    """
    for bom, encoding in [
        (codecs.BOM_UTF32_BE, "utf_32_be"),
        (codecs.BOM_UTF32_LE, "utf_32_le"),
        (codecs.BOM_UTF16_BE, "utf_16_be"),
        (codecs.BOM_UTF16_LE, "utf_16_le"),
        (codecs.BOM_UTF8, "utf_8_sig"),
    ]:
        if data.startswith(bom):
            return encoding

    if all(b < 128 for b in data):
        return "ascii"  # you may want to use the fallback here if data is only a sample.

    decoder = codecs.getincrementaldecoder("utf_8")()
    try:
        decoder.decode(data, final=False)
    except UnicodeDecodeError:
        return fallback
    else:
        return "utf_8"  # not certain if data is only a sample

记住，非通用编码可能会失败。decode方法的errors参数可以设置为'ignore'， 'replace'或'backslashreplace'以避免异常。

2022-05-27 07:38:53

这个网站有python代码识别ascii，编码bom和utf8 no bom: https://unicodebook.readthedocs.io/guess_encoding.html。将文件读入字节数组(data): http://www.codecodex.com/wiki/Read_a_file_into_a_byte_array。举个例子。我在osx。

#!/usr/bin/python                                                                                                  

import sys

def isUTF8(data):
    try:
        decoded = data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        for ch in decoded:
            if 0xD800 <= ord(ch) <= 0xDFFF:
                return False
        return True

def get_bytes_from_file(filename):
    return open(filename, "rb").read()

filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)


PS /Users/js> ./isutf8.py hi.txt                                                                                     
True

2019-04-12 14:10:16

编辑:chardet似乎无人维护，但大部分答案适用。请登录https://pypi.org/project/charset-normalizer/查看其他选择

始终正确地检测编码是不可能的。

(来自chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

有一个chardet库利用这项研究来检测编码。chardet是Mozilla中自动检测代码的一个端口。

你也可以使用UnicodeDammit。它将尝试以下方法:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document. An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII. An encoding sniffed by the chardet library, if you have it installed. UTF-8 Windows-1252

2009-01-12 17:45:32

如果你不满意自动工具，你可以尝试所有的编解码器，看看哪个编解码器是正确的手动。

all_codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 
'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 
'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', 
'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 
'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 
'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 
'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 
'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 
'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u', 
'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 
'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 
'utf_8', 'utf_8_sig']

def find_codec(text):
    for i in all_codecs:
        for j in all_codecs:
            try:
                print(i, "to", j, text.encode(i).decode(j))
            except:
                pass

find_codec("The example string which includes ö, ü, or ÄŸ, Ã¶")

这个脚本至少创建了9409行输出。因此，如果输出不能适应终端屏幕，请尝试将输出写入文本文件。

2021-08-08 13:36:36

如何确定文本的编码

推荐文章

最新文章

标签