我收到了一些编码的文本,但我不知道使用的是什么字符集。是否有一种方法可以使用Python确定文本文件的编码?如何检测文本文件的编码/代码页处理c#。


在一般情况下,原则上不可能确定文本文件的编码。所以没有标准的Python库来帮你做这个。

如果您对文本文件有更具体的了解(例如,它是XML),可能会有库函数。


如果你知道文件的一些内容,你可以尝试用几种编码来解码它,看看哪个丢失了。一般来说没有办法,因为文本文件就是文本文件,这些都是愚蠢的;)


编辑:chardet似乎无人维护,但大部分答案适用。请登录https://pypi.org/project/charset-normalizer/查看其他选择

始终正确地检测编码是不可能的。

(来自chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

有一个chardet库利用这项研究来检测编码。chardet是Mozilla中自动检测代码的一个端口。

你也可以使用UnicodeDammit。它将尝试以下方法:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document. An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII. An encoding sniffed by the chardet library, if you have it installed. UTF-8 Windows-1252


一些编码策略,请取消评论品味:

#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile

您可能希望通过以循环的形式打开并读取文件来检查编码…但是你可能需要先检查文件大小:

# PYTHON
encodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more
for e in encodings:
    try:
        fh = codecs.open('file.txt', 'r', encoding=e)
        fh.readlines()
        fh.seek(0)
    except UnicodeDecodeError:
        print('got unicode error with %s , trying different encoding' % e)
    else:
        print('opening the file with encoding:  %s ' % e)
        break

另一种计算编码的方法是使用 的代码 文件命令)。有大量的 可用的Python绑定。

文件源树中的python绑定可作为 Python-magic(或python3-magic) debian软件包。它可以通过执行以下操作来确定文件的编码:

import magic

blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc

在pypi上有一个名称相同但不兼容的python-magic pip包,它也使用libmagic。它也可以得到编码,通过这样做:

import magic

blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)

# Function: OpenRead(file)

# A text file can be encoded using:
#   (1) The default operating system code page, Or
#   (2) utf8 with a BOM header
#
#  If a text file is encoded with utf8, and does not have a BOM header,
#  the user can manually add a BOM header to the text file
#  using a text editor such as notepad++, and rerun the python script,
#  otherwise the file is read as a codepage file with the 
#  invalid codepage characters removed

import sys
if int(sys.version[0]) != 3:
    print('Aborted: Python 3.x required')
    sys.exit(1)

def bomType(file):
    """
    returns file encoding string for open() function

    EXAMPLE:
        bom = bomtype(file)
        open(file, encoding=bom, errors='ignore')
    """

    f = open(file, 'rb')
    b = f.read(4)
    f.close()

    if (b[0:3] == b'\xef\xbb\xbf'):
        return "utf8"

    # Python automatically detects endianess if utf-16 bom is present
    # write endianess generally determined by endianess of CPU
    if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
        return "utf16"

    if ((b[0:5] == b'\xfe\xff\x00\x00') 
              or (b[0:5] == b'\x00\x00\xff\xfe')):
        return "utf32"

    # If BOM is not provided, then assume its the codepage
    #     used by your operating system
    return "cp1252"
    # For the United States its: cp1252


def OpenRead(file):
    bom = bomType(file)
    return open(file, 'r', encoding=bom, errors='ignore')


#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()

fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("\u2022 hi there (utf8)")
fout.close()

# this case is still treated like codepage cp1252
#   (User responsible for making sure that all utf8 files
#   have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there.  barf(\x81\x8D\x90\x9D)")
fout.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline() 
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()

# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline() 
print(L)
fin.close()

# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()

根据您的平台,我只选择使用linux shell文件命令。这适用于我,因为我使用它在一个脚本,专门运行在我们的linux机器之一。

显然,这不是一个理想的解决方案或答案,但可以根据您的需要进行修改。在我的例子中,我只需要确定一个文件是否为UTF-8。

import subprocess
file_cmd = ['file', 'test.txt']
p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE)
cmd_output = p.stdout.readlines()
# x will begin with the file type output as is observed using 'file' command
x = cmd_output[0].split(": ")[1]
return x.startswith('UTF-8')

下面是一个读取并接受一个chardet编码预测的例子,如果它很大,则从文件中读取n_lines。

Chardet还提供了它的编码预测的概率(即置信度)(还没有看到他们是如何提出的),它与Chardet .predict()的预测一起返回,所以如果你喜欢,你可以以某种方式使用它。

import chardet
from pathlib import Path

def predict_encoding(file_path: Path, n_lines: int=20) -> str:
    '''Predict a file's encoding using chardet'''

    # Open the file as binary data
    with Path(file_path).open('rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

这个网站有python代码识别ascii,编码bom和utf8 no bom: https://unicodebook.readthedocs.io/guess_encoding.html。将文件读入字节数组(data): http://www.codecodex.com/wiki/Read_a_file_into_a_byte_array。举个例子。我在osx。

#!/usr/bin/python                                                                                                  

import sys

def isUTF8(data):
    try:
        decoded = data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        for ch in decoded:
            if 0xD800 <= ord(ch) <= 0xDFFF:
                return False
        return True

def get_bytes_from_file(filename):
    return open(filename, "rb").read()

filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)


PS /Users/js> ./isutf8.py hi.txt                                                                                     
True

这可能会有帮助

from bs4 import UnicodeDammit
with open('automate_data/billboard.csv', 'rb') as file:
   content = file.read()

suggestion = UnicodeDammit(content)
suggestion.original_encoding
#'iso-8859-1'

使用linux file -i命令

import subprocess

file = "path/to/file/file.txt"

encoding =  subprocess.Popen("file -bi "+file, shell=True, stdout=subprocess.PIPE).stdout

encoding = re.sub(r"(\\n)[^a-z0-9\-]", "", str(encoding.read()).split("=")[1], flags=re.IGNORECASE)
    
print(encoding)

你可以使用' python-magic package,它不会将整个文件加载到内存中:

import magic


def detect(
    file_path,
):
    return magic.Magic(
        mime_encoding=True,
    ).from_file(file_path)

输出是编码名称,例如:

iso - 8859 - 1 us - ascii utf - 8


你可以使用chardet模块

import chardet

with open (filepath , "rb") as f:
    data= f.read()
    encode=chardet.UniversalDetector()
    encode.close()
    print(encode.result)

或者你可以在linux中使用chardet3命令,但这需要一些时间:

chardet3 fileName

例子:

chardet3 donnee/dir/donnee.csv
donnee/dir/donnee.csv: ISO-8859-1 with confidence 0.73


如果你不满意自动工具,你可以尝试所有的编解码器,看看哪个编解码器是正确的手动。

all_codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 
'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 
'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', 
'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 
'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 
'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 
'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 
'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 
'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u', 
'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 
'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 
'utf_8', 'utf_8_sig']

def find_codec(text):
    for i in all_codecs:
        for j in all_codecs:
            try:
                print(i, "to", j, text.encode(i).decode(j))
            except:
                pass

find_codec("The example string which includes ö, ü, or ÄŸ, ö")

这个脚本至少创建了9409行输出。因此,如果输出不能适应终端屏幕,请尝试将输出写入文本文件。


一些文本文件知道它们的编码,大多数则不是。意识到:

具有BOM的文本文件 XML文件以UTF-8编码或其编码在序言中给出 JSON文件总是用UTF-8编码

没有意识到:

CSV文件 任意文本文件

有些编码是通用的,即它们可以解码任何字节序列,有些则不是。US-ASCII不是万能的,因为任何大于127的字节都不能映射到任何字符。UTF-8不是万能的,因为任何字节序列都是无效的。

相反,Latin-1, Windows-1252等是通用的(即使一些字节没有正式映射到一个字符):

>>> [b.to_bytes(1, 'big').decode("latin-1") for b in range(256)]
['\x00', ..., 'ÿ']

给定一个以字节序列编码的随机文本文件,除非该文件知道其编码,否则无法确定其编码,因为有些编码是通用的。但有时可以排除非通用编码。所有通用编码仍然是可能的。chardet模块使用字节的频率来猜测哪种编码最适合已编码的文本。

如果你不想使用这个模块或类似的模块,这里有一个简单的方法:

检查文件是否知道其编码(BOM) 检查非通用编码并接受第一个可以解码字节的编码(ASCII在UTF-8之前,因为它更严格) 选择一个回退编码。

如果您只检查一个示例,那么第二步有点风险,因为文件其余部分中的某些字节可能是无效的。

代码:

def guess_encoding(data: bytes, fallback: str = "iso8859_15") -> str:
    """
    A basic encoding detector.
    """
    for bom, encoding in [
        (codecs.BOM_UTF32_BE, "utf_32_be"),
        (codecs.BOM_UTF32_LE, "utf_32_le"),
        (codecs.BOM_UTF16_BE, "utf_16_be"),
        (codecs.BOM_UTF16_LE, "utf_16_le"),
        (codecs.BOM_UTF8, "utf_8_sig"),
    ]:
        if data.startswith(bom):
            return encoding

    if all(b < 128 for b in data):
        return "ascii"  # you may want to use the fallback here if data is only a sample.

    decoder = codecs.getincrementaldecoder("utf_8")()
    try:
        decoder.decode(data, final=False)
    except UnicodeDecodeError:
        return fallback
    else:
        return "utf_8"  # not certain if data is only a sample

记住,非通用编码可能会失败。decode方法的errors参数可以设置为'ignore', 'replace'或'backslashreplace'以避免异常。


很久以前,我有这样的需求。

阅读我的旧代码,我发现了这个:

    import urllib.request
    import chardet
    import os
    import settings

    [...]
    file = 'sources/dl/file.csv'
    media_folder = settings.MEDIA_ROOT
    file = os.path.join(media_folder, str(file))
    if os.path.isfile(file):
        file_2_test = urllib.request.urlopen('file://' + file).read()
        encoding = (chardet.detect(file_2_test))['encoding']
        return encoding

这为我工作,并返回ascii