我收到了一些编码的文本,但我不知道使用的是什么字符集。是否有一种方法可以使用Python确定文本文件的编码?如何检测文本文件的编码/代码页处理c#。


当前回答

如果你不满意自动工具,你可以尝试所有的编解码器,看看哪个编解码器是正确的手动。

all_codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 
'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 
'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', 
'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 
'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 
'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 
'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 
'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 
'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u', 
'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 
'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 
'utf_8', 'utf_8_sig']

def find_codec(text):
    for i in all_codecs:
        for j in all_codecs:
            try:
                print(i, "to", j, text.encode(i).decode(j))
            except:
                pass

find_codec("The example string which includes ö, ü, or ÄŸ, ö")

这个脚本至少创建了9409行输出。因此,如果输出不能适应终端屏幕,请尝试将输出写入文本文件。

其他回答

# Function: OpenRead(file)

# A text file can be encoded using:
#   (1) The default operating system code page, Or
#   (2) utf8 with a BOM header
#
#  If a text file is encoded with utf8, and does not have a BOM header,
#  the user can manually add a BOM header to the text file
#  using a text editor such as notepad++, and rerun the python script,
#  otherwise the file is read as a codepage file with the 
#  invalid codepage characters removed

import sys
if int(sys.version[0]) != 3:
    print('Aborted: Python 3.x required')
    sys.exit(1)

def bomType(file):
    """
    returns file encoding string for open() function

    EXAMPLE:
        bom = bomtype(file)
        open(file, encoding=bom, errors='ignore')
    """

    f = open(file, 'rb')
    b = f.read(4)
    f.close()

    if (b[0:3] == b'\xef\xbb\xbf'):
        return "utf8"

    # Python automatically detects endianess if utf-16 bom is present
    # write endianess generally determined by endianess of CPU
    if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
        return "utf16"

    if ((b[0:5] == b'\xfe\xff\x00\x00') 
              or (b[0:5] == b'\x00\x00\xff\xfe')):
        return "utf32"

    # If BOM is not provided, then assume its the codepage
    #     used by your operating system
    return "cp1252"
    # For the United States its: cp1252


def OpenRead(file):
    bom = bomType(file)
    return open(file, 'r', encoding=bom, errors='ignore')


#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()

fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("\u2022 hi there (utf8)")
fout.close()

# this case is still treated like codepage cp1252
#   (User responsible for making sure that all utf8 files
#   have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there.  barf(\x81\x8D\x90\x9D)")
fout.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline() 
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()

# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline() 
print(L)
fin.close()

# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()

在一般情况下,原则上不可能确定文本文件的编码。所以没有标准的Python库来帮你做这个。

如果您对文本文件有更具体的了解(例如,它是XML),可能会有库函数。

下面是一个读取并接受一个chardet编码预测的例子,如果它很大,则从文件中读取n_lines。

Chardet还提供了它的编码预测的概率(即置信度)(还没有看到他们是如何提出的),它与Chardet .predict()的预测一起返回,所以如果你喜欢,你可以以某种方式使用它。

import chardet
from pathlib import Path

def predict_encoding(file_path: Path, n_lines: int=20) -> str:
    '''Predict a file's encoding using chardet'''

    # Open the file as binary data
    with Path(file_path).open('rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

这可能会有帮助

from bs4 import UnicodeDammit
with open('automate_data/billboard.csv', 'rb') as file:
   content = file.read()

suggestion = UnicodeDammit(content)
suggestion.original_encoding
#'iso-8859-1'

你可以使用chardet模块

import chardet

with open (filepath , "rb") as f:
    data= f.read()
    encode=chardet.UniversalDetector()
    encode.close()
    print(encode.result)

或者你可以在linux中使用chardet3命令,但这需要一些时间:

chardet3 fileName

例子:

chardet3 donnee/dir/donnee.csv
donnee/dir/donnee.csv: ISO-8859-1 with confidence 0.73