我收到了一些编码的文本,但我不知道使用的是什么字符集。是否有一种方法可以使用Python确定文本文件的编码?如何检测文本文件的编码/代码页处理c#。


当前回答

你可以使用' python-magic package,它不会将整个文件加载到内存中:

import magic


def detect(
    file_path,
):
    return magic.Magic(
        mime_encoding=True,
    ).from_file(file_path)

输出是编码名称,例如:

iso - 8859 - 1 us - ascii utf - 8

其他回答

一些编码策略,请取消评论品味:

#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile

您可能希望通过以循环的形式打开并读取文件来检查编码…但是你可能需要先检查文件大小:

# PYTHON
encodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more
for e in encodings:
    try:
        fh = codecs.open('file.txt', 'r', encoding=e)
        fh.readlines()
        fh.seek(0)
    except UnicodeDecodeError:
        print('got unicode error with %s , trying different encoding' % e)
    else:
        print('opening the file with encoding:  %s ' % e)
        break
# Function: OpenRead(file)

# A text file can be encoded using:
#   (1) The default operating system code page, Or
#   (2) utf8 with a BOM header
#
#  If a text file is encoded with utf8, and does not have a BOM header,
#  the user can manually add a BOM header to the text file
#  using a text editor such as notepad++, and rerun the python script,
#  otherwise the file is read as a codepage file with the 
#  invalid codepage characters removed

import sys
if int(sys.version[0]) != 3:
    print('Aborted: Python 3.x required')
    sys.exit(1)

def bomType(file):
    """
    returns file encoding string for open() function

    EXAMPLE:
        bom = bomtype(file)
        open(file, encoding=bom, errors='ignore')
    """

    f = open(file, 'rb')
    b = f.read(4)
    f.close()

    if (b[0:3] == b'\xef\xbb\xbf'):
        return "utf8"

    # Python automatically detects endianess if utf-16 bom is present
    # write endianess generally determined by endianess of CPU
    if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
        return "utf16"

    if ((b[0:5] == b'\xfe\xff\x00\x00') 
              or (b[0:5] == b'\x00\x00\xff\xfe')):
        return "utf32"

    # If BOM is not provided, then assume its the codepage
    #     used by your operating system
    return "cp1252"
    # For the United States its: cp1252


def OpenRead(file):
    bom = bomType(file)
    return open(file, 'r', encoding=bom, errors='ignore')


#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()

fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("\u2022 hi there (utf8)")
fout.close()

# this case is still treated like codepage cp1252
#   (User responsible for making sure that all utf8 files
#   have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there.  barf(\x81\x8D\x90\x9D)")
fout.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline() 
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()

# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline() 
print(L)
fin.close()

# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()

编辑:chardet似乎无人维护,但大部分答案适用。请登录https://pypi.org/project/charset-normalizer/查看其他选择

始终正确地检测编码是不可能的。

(来自chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

有一个chardet库利用这项研究来检测编码。chardet是Mozilla中自动检测代码的一个端口。

你也可以使用UnicodeDammit。它将尝试以下方法:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document. An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII. An encoding sniffed by the chardet library, if you have it installed. UTF-8 Windows-1252

使用linux file -i命令

import subprocess

file = "path/to/file/file.txt"

encoding =  subprocess.Popen("file -bi "+file, shell=True, stdout=subprocess.PIPE).stdout

encoding = re.sub(r"(\\n)[^a-z0-9\-]", "", str(encoding.read()).split("=")[1], flags=re.IGNORECASE)
    
print(encoding)

这个网站有python代码识别ascii,编码bom和utf8 no bom: https://unicodebook.readthedocs.io/guess_encoding.html。将文件读入字节数组(data): http://www.codecodex.com/wiki/Read_a_file_into_a_byte_array。举个例子。我在osx。

#!/usr/bin/python                                                                                                  

import sys

def isUTF8(data):
    try:
        decoded = data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        for ch in decoded:
            if 0xD800 <= ord(ch) <= 0xDFFF:
                return False
        return True

def get_bytes_from_file(filename):
    return open(filename, "rb").read()

filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)


PS /Users/js> ./isutf8.py hi.txt                                                                                     
True