我收到了一些编码的文本,但我不知道使用的是什么字符集。是否有一种方法可以使用Python确定文本文件的编码?如何检测文本文件的编码/代码页处理c#。


当前回答

这个网站有python代码识别ascii,编码bom和utf8 no bom: https://unicodebook.readthedocs.io/guess_encoding.html。将文件读入字节数组(data): http://www.codecodex.com/wiki/Read_a_file_into_a_byte_array。举个例子。我在osx。

#!/usr/bin/python                                                                                                  

import sys

def isUTF8(data):
    try:
        decoded = data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        for ch in decoded:
            if 0xD800 <= ord(ch) <= 0xDFFF:
                return False
        return True

def get_bytes_from_file(filename):
    return open(filename, "rb").read()

filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)


PS /Users/js> ./isutf8.py hi.txt                                                                                     
True

其他回答

编辑:chardet似乎无人维护,但大部分答案适用。请登录https://pypi.org/project/charset-normalizer/查看其他选择

始终正确地检测编码是不可能的。

(来自chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

有一个chardet库利用这项研究来检测编码。chardet是Mozilla中自动检测代码的一个端口。

你也可以使用UnicodeDammit。它将尝试以下方法:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document. An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII. An encoding sniffed by the chardet library, if you have it installed. UTF-8 Windows-1252

你可以使用' python-magic package,它不会将整个文件加载到内存中:

import magic


def detect(
    file_path,
):
    return magic.Magic(
        mime_encoding=True,
    ).from_file(file_path)

输出是编码名称,例如:

iso - 8859 - 1 us - ascii utf - 8

下面是一个读取并接受一个chardet编码预测的例子,如果它很大,则从文件中读取n_lines。

Chardet还提供了它的编码预测的概率(即置信度)(还没有看到他们是如何提出的),它与Chardet .predict()的预测一起返回,所以如果你喜欢,你可以以某种方式使用它。

import chardet
from pathlib import Path

def predict_encoding(file_path: Path, n_lines: int=20) -> str:
    '''Predict a file's encoding using chardet'''

    # Open the file as binary data
    with Path(file_path).open('rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

一些编码策略,请取消评论品味:

#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile

您可能希望通过以循环的形式打开并读取文件来检查编码…但是你可能需要先检查文件大小:

# PYTHON
encodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more
for e in encodings:
    try:
        fh = codecs.open('file.txt', 'r', encoding=e)
        fh.readlines()
        fh.seek(0)
    except UnicodeDecodeError:
        print('got unicode error with %s , trying different encoding' % e)
    else:
        print('opening the file with encoding:  %s ' % e)
        break

如果你不满意自动工具,你可以尝试所有的编解码器,看看哪个编解码器是正确的手动。

all_codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 
'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 
'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', 
'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 
'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 
'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 
'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 
'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 
'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u', 
'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 
'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 
'utf_8', 'utf_8_sig']

def find_codec(text):
    for i in all_codecs:
        for j in all_codecs:
            try:
                print(i, "to", j, text.encode(i).decode(j))
            except:
                pass

find_codec("The example string which includes ö, ü, or ÄŸ, ö")

这个脚本至少创建了9409行输出。因此,如果输出不能适应终端屏幕,请尝试将输出写入文本文件。