我收到了一些编码的文本,但我不知道使用的是什么字符集。是否有一种方法可以使用Python确定文本文件的编码?如何检测文本文件的编码/代码页处理c#。
当前回答
如果你知道文件的一些内容,你可以尝试用几种编码来解码它,看看哪个丢失了。一般来说没有办法,因为文本文件就是文本文件,这些都是愚蠢的;)
其他回答
如果你不满意自动工具,你可以尝试所有的编解码器,看看哪个编解码器是正确的手动。
all_codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437',
'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857',
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869',
'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125',
'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256',
'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr',
'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2',
'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1',
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7',
'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13',
'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u',
'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman',
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213',
'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7',
'utf_8', 'utf_8_sig']
def find_codec(text):
for i in all_codecs:
for j in all_codecs:
try:
print(i, "to", j, text.encode(i).decode(j))
except:
pass
find_codec("The example string which includes ö, ü, or ÄŸ, ö")
这个脚本至少创建了9409行输出。因此,如果输出不能适应终端屏幕,请尝试将输出写入文本文件。
# Function: OpenRead(file)
# A text file can be encoded using:
# (1) The default operating system code page, Or
# (2) utf8 with a BOM header
#
# If a text file is encoded with utf8, and does not have a BOM header,
# the user can manually add a BOM header to the text file
# using a text editor such as notepad++, and rerun the python script,
# otherwise the file is read as a codepage file with the
# invalid codepage characters removed
import sys
if int(sys.version[0]) != 3:
print('Aborted: Python 3.x required')
sys.exit(1)
def bomType(file):
"""
returns file encoding string for open() function
EXAMPLE:
bom = bomtype(file)
open(file, encoding=bom, errors='ignore')
"""
f = open(file, 'rb')
b = f.read(4)
f.close()
if (b[0:3] == b'\xef\xbb\xbf'):
return "utf8"
# Python automatically detects endianess if utf-16 bom is present
# write endianess generally determined by endianess of CPU
if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
return "utf16"
if ((b[0:5] == b'\xfe\xff\x00\x00')
or (b[0:5] == b'\x00\x00\xff\xfe')):
return "utf32"
# If BOM is not provided, then assume its the codepage
# used by your operating system
return "cp1252"
# For the United States its: cp1252
def OpenRead(file):
bom = bomType(file)
return open(file, 'r', encoding=bom, errors='ignore')
#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()
fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("\u2022 hi there (utf8)")
fout.close()
# this case is still treated like codepage cp1252
# (User responsible for making sure that all utf8 files
# have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there. barf(\x81\x8D\x90\x9D)")
fout.close()
# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()
# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline()
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()
# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline()
print(L)
fin.close()
# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()
一些编码策略,请取消评论品味:
#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile
您可能希望通过以循环的形式打开并读取文件来检查编码…但是你可能需要先检查文件大小:
# PYTHON
encodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more
for e in encodings:
try:
fh = codecs.open('file.txt', 'r', encoding=e)
fh.readlines()
fh.seek(0)
except UnicodeDecodeError:
print('got unicode error with %s , trying different encoding' % e)
else:
print('opening the file with encoding: %s ' % e)
break
另一种计算编码的方法是使用 的代码 文件命令)。有大量的 可用的Python绑定。
文件源树中的python绑定可作为 Python-magic(或python3-magic) debian软件包。它可以通过执行以下操作来确定文件的编码:
import magic
blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob) # "utf-8" "us-ascii" etc
在pypi上有一个名称相同但不兼容的python-magic pip包,它也使用libmagic。它也可以得到编码,通过这样做:
import magic
blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)
这个网站有python代码识别ascii,编码bom和utf8 no bom: https://unicodebook.readthedocs.io/guess_encoding.html。将文件读入字节数组(data): http://www.codecodex.com/wiki/Read_a_file_into_a_byte_array。举个例子。我在osx。
#!/usr/bin/python
import sys
def isUTF8(data):
try:
decoded = data.decode('UTF-8')
except UnicodeDecodeError:
return False
else:
for ch in decoded:
if 0xD800 <= ord(ch) <= 0xDFFF:
return False
return True
def get_bytes_from_file(filename):
return open(filename, "rb").read()
filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)
PS /Users/js> ./isutf8.py hi.txt
True
推荐文章
- 将一个列表分成大约相等长度的N个部分
- Python __str__与__unicode__
- 在python中,del和delattr哪个更好?
- 如何动态加载Python类
- 有没有办法在python中做HTTP PUT
- “foo Is None”和“foo == None”之间有什么区别吗?
- 类没有对象成员
- Django模型“没有显式声明app_label”
- 熊猫能自动从CSV文件中读取日期吗?
- 在python中zip的逆函数是什么?
- 有效的方法应用多个过滤器的熊猫数据框架或系列
- 如何检索插入id后插入行在SQLite使用Python?
- 我如何在Django中添加一个CharField占位符?
- 如何在Python中获取当前执行文件的路径?
- 我如何得到“id”后插入到MySQL数据库与Python?