我收到了一些编码的文本,但我不知道使用的是什么字符集。是否有一种方法可以使用Python确定文本文件的编码?如何检测文本文件的编码/代码页处理c#。
当前回答
在一般情况下,原则上不可能确定文本文件的编码。所以没有标准的Python库来帮你做这个。
如果您对文本文件有更具体的了解(例如,它是XML),可能会有库函数。
其他回答
使用linux file -i命令
import subprocess
file = "path/to/file/file.txt"
encoding = subprocess.Popen("file -bi "+file, shell=True, stdout=subprocess.PIPE).stdout
encoding = re.sub(r"(\\n)[^a-z0-9\-]", "", str(encoding.read()).split("=")[1], flags=re.IGNORECASE)
print(encoding)
在一般情况下,原则上不可能确定文本文件的编码。所以没有标准的Python库来帮你做这个。
如果您对文本文件有更具体的了解(例如,它是XML),可能会有库函数。
如果你知道文件的一些内容,你可以尝试用几种编码来解码它,看看哪个丢失了。一般来说没有办法,因为文本文件就是文本文件,这些都是愚蠢的;)
编辑:chardet似乎无人维护,但大部分答案适用。请登录https://pypi.org/project/charset-normalizer/查看其他选择
始终正确地检测编码是不可能的。
(来自chardet FAQ:)
However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.
有一个chardet库利用这项研究来检测编码。chardet是Mozilla中自动检测代码的一个端口。
你也可以使用UnicodeDammit。它将尝试以下方法:
An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document. An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII. An encoding sniffed by the chardet library, if you have it installed. UTF-8 Windows-1252
这个网站有python代码识别ascii,编码bom和utf8 no bom: https://unicodebook.readthedocs.io/guess_encoding.html。将文件读入字节数组(data): http://www.codecodex.com/wiki/Read_a_file_into_a_byte_array。举个例子。我在osx。
#!/usr/bin/python
import sys
def isUTF8(data):
try:
decoded = data.decode('UTF-8')
except UnicodeDecodeError:
return False
else:
for ch in decoded:
if 0xD800 <= ord(ch) <= 0xDFFF:
return False
return True
def get_bytes_from_file(filename):
return open(filename, "rb").read()
filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)
PS /Users/js> ./isutf8.py hi.txt
True
推荐文章
- 证书验证失败:无法获得本地颁发者证书
- 当使用pip3安装包时,“Python中的ssl模块不可用”
- 无法切换Python与pyenv
- Python if not == vs if !=
- 如何从scikit-learn决策树中提取决策规则?
- 为什么在Mac OS X v10.9 (Mavericks)的终端中apt-get功能不起作用?
- 将旋转的xtick标签与各自的xtick对齐
- 为什么元组可以包含可变项?
- 如何合并字典的字典?
- 如何创建类属性?
- 不区分大小写的“in”
- 在Python中获取迭代器中的元素个数
- 解析日期字符串并更改格式
- 使用try和。Python中的if
- 如何在Python中获得所有直接子目录