如何确定文本的编码

我收到了一些编码的文本，但我不知道使用的是什么字符集。是否有一种方法可以使用Python确定文本文件的编码?如何检测文本文件的编码/代码页处理c#。

当前回答

下面是一个读取并接受一个chardet编码预测的例子，如果它很大，则从文件中读取n_lines。

Chardet还提供了它的编码预测的概率(即置信度)(还没有看到他们是如何提出的)，它与Chardet .predict()的预测一起返回，所以如果你喜欢，你可以以某种方式使用它。

import chardet
from pathlib import Path

def predict_encoding(file_path: Path, n_lines: int=20) -> str:
    '''Predict a file's encoding using chardet'''

    # Open the file as binary data
    with Path(file_path).open('rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

2017-07-18 13:01:49

其他回答

你可以使用' python-magic package，它不会将整个文件加载到内存中:

import magic


def detect(
    file_path,
):
    return magic.Magic(
        mime_encoding=True,
    ).from_file(file_path)

输出是编码名称，例如:

iso - 8859 - 1 us - ascii utf - 8

2021-05-27 10:30:32

很久以前，我有这样的需求。

阅读我的旧代码，我发现了这个:

    import urllib.request
    import chardet
    import os
    import settings

    [...]
    file = 'sources/dl/file.csv'
    media_folder = settings.MEDIA_ROOT
    file = os.path.join(media_folder, str(file))
    if os.path.isfile(file):
        file_2_test = urllib.request.urlopen('file://' + file).read()
        encoding = (chardet.detect(file_2_test))['encoding']
        return encoding

这为我工作，并返回ascii

2022-10-12 12:47:28

下面是一个读取并接受一个chardet编码预测的例子，如果它很大，则从文件中读取n_lines。

import chardet
from pathlib import Path

def predict_encoding(file_path: Path, n_lines: int=20) -> str:
    '''Predict a file's encoding using chardet'''

    # Open the file as binary data
    with Path(file_path).open('rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

2017-07-18 13:01:49

根据您的平台，我只选择使用linux shell文件命令。这适用于我，因为我使用它在一个脚本，专门运行在我们的linux机器之一。

显然，这不是一个理想的解决方案或答案，但可以根据您的需要进行修改。在我的例子中，我只需要确定一个文件是否为UTF-8。

import subprocess
file_cmd = ['file', 'test.txt']
p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE)
cmd_output = p.stdout.readlines()
# x will begin with the file type output as is observed using 'file' command
x = cmd_output[0].split(": ")[1]
return x.startswith('UTF-8')

2017-06-22 16:39:12

一些文本文件知道它们的编码，大多数则不是。意识到:

具有BOM的文本文件 XML文件以UTF-8编码或其编码在序言中给出 JSON文件总是用UTF-8编码

没有意识到:

CSV文件任意文本文件

有些编码是通用的，即它们可以解码任何字节序列，有些则不是。US-ASCII不是万能的，因为任何大于127的字节都不能映射到任何字符。UTF-8不是万能的，因为任何字节序列都是无效的。

相反，Latin-1, Windows-1252等是通用的(即使一些字节没有正式映射到一个字符):

>>> [b.to_bytes(1, 'big').decode("latin-1") for b in range(256)]
['\x00', ..., 'ÿ']

给定一个以字节序列编码的随机文本文件，除非该文件知道其编码，否则无法确定其编码，因为有些编码是通用的。但有时可以排除非通用编码。所有通用编码仍然是可能的。chardet模块使用字节的频率来猜测哪种编码最适合已编码的文本。

如果你不想使用这个模块或类似的模块，这里有一个简单的方法:

检查文件是否知道其编码(BOM) 检查非通用编码并接受第一个可以解码字节的编码(ASCII在UTF-8之前，因为它更严格) 选择一个回退编码。

如果您只检查一个示例，那么第二步有点风险，因为文件其余部分中的某些字节可能是无效的。

代码:

def guess_encoding(data: bytes, fallback: str = "iso8859_15") -> str:
    """
    A basic encoding detector.
    """
    for bom, encoding in [
        (codecs.BOM_UTF32_BE, "utf_32_be"),
        (codecs.BOM_UTF32_LE, "utf_32_le"),
        (codecs.BOM_UTF16_BE, "utf_16_be"),
        (codecs.BOM_UTF16_LE, "utf_16_le"),
        (codecs.BOM_UTF8, "utf_8_sig"),
    ]:
        if data.startswith(bom):
            return encoding

    if all(b < 128 for b in data):
        return "ascii"  # you may want to use the fallback here if data is only a sample.

    decoder = codecs.getincrementaldecoder("utf_8")()
    try:
        decoder.decode(data, final=False)
    except UnicodeDecodeError:
        return fallback
    else:
        return "utf_8"  # not certain if data is only a sample

记住，非通用编码可能会失败。decode方法的errors参数可以设置为'ignore'， 'replace'或'backslashreplace'以避免异常。

2022-05-27 07:38:53

如何确定文本的编码

推荐文章

最新文章

标签