将字节转换为字符串

我将外部程序的标准输出捕获到一个字节对象中：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>>
>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

我想把它转换成一个普通的Python字符串，这样我就可以这样打印了：

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

如何使用Python 3将字节对象转换为str？

当前回答

要将字节序列解释为文本，您必须知道对应字符编码：

unicode_text = bytestring.decode(character_encoding)

例子：

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls命令可能产生无法解释为文本的输出。文件名在Unix上可以是除斜杠b'/'和零之外的任何字节序列b“\0”：

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

尝试使用utf-8编码解码此类字节汤会引发UnicodeDecodeError。

可能会更糟。解码可能会无声地失败并产生mojibake如果使用错误的不兼容编码：

>>> '—'.encode('utf-8').decode('cp1252')
'â€”'

数据已损坏，但您的程序仍然没有意识到故障已发生。

通常，要使用的字符编码不会嵌入字节序列本身。你必须在乐队外传达这些信息。某些结果比其他结果更有可能，因此存在可以猜测字符编码的chardet模块。一个Python脚本可以在不同的地方使用多个字符编码。

ls输出可以使用os.fsdecode（）转换为Python字符串即使对于不可编码的函数也会成功文件名（它使用上的sys.getfilesystemencoding（）和surrogateescape错误处理程序Unix）：

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

要获取原始字节，可以使用os.fencoder（）。

如果传递universal_newlines=True参数，则子流程使用locale.getpreferredencoding（False）以解码字节，例如，它可以是Windows上的cp1252。

为了动态解码字节流，io.TextIOWrapper（）例如。

不同的命令可能使用不同的字符编码例如dir内部命令（cmd）的输出可以使用cp437。解码其输出，可以显式传递编码（Python 3.6+）：

output = subprocess.check_output('dir', shell=True, encoding='cp437')

文件名可能不同于os.listdir（）（它使用WindowsUnicode API），例如，“\xb6”可以替换为“\x14”-Python的cp437编解码器将b'\x14'映射到控制字符U+0014，而不是U+00B6（¶）。要支持具有任意Unicode字符的文件名，请参阅将可能包含非ASCII Unicode字符的PowerShell输出解码为Python字符串

2016-11-16 09:43:26

其他回答

字节

m=b'This is bytes'

转换为字符串

方法1

m.decode("utf-8")

m.decode()

方法2

import codecs
codecs.decode(m,encoding="utf-8")

import codecs
codecs.decode(m)

方法3

str(m,encoding="utf-8")

str(m)[2:-1]

后果

'This is bytes'

2022-06-21 13:18:28

我想你真的想要这样：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron的回答是正确的，只是你需要知道使用哪种编码。我相信Windows使用的是“Windows-1252”。只有当你的内容中有一些不寻常的（非ASCII）字符时，这才是重要的，但这会产生影响。

顺便说一句，这一点很重要，这是Python转而使用两种不同类型的二进制数据和文本数据的原因：它无法在它们之间进行神奇的转换，因为除非你告诉它，否则它不知道编码！您知道的唯一方法是阅读Windows文档（或在此处阅读）。

2011-07-18 19:51:15

def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))

2018-06-03 22:44:45

如果您不知道编码，那么要以Python 3和Python 2兼容的方式将二进制输入读取为字符串，请使用古老的MS-DOS CP437编码：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

因为编码未知，所以期望非英语符号转换为cp437字符（英语字符不被转换，因为它们在大多数单字节编码和UTF-8中都匹配）。

将任意二进制输入解码为UTF-8是不安全的，因为您可能会得到以下结果：

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

这同样适用于latin-1，它在Python 2中很流行（默认？）。查看Codepage Layout中缺少的点——Python就是在这里用不在范围内的臭名昭著的序数词窒息的。

更新20150604:有传言称，Python 3具有将数据编码为二进制数据而不会丢失和崩溃的替代性错误策略，但它需要转换测试[binary]->[str]->[binary]来验证性能和可靠性。

更新20170116：感谢Nearoo的评论-也有可能使用反斜杠替换错误处理程序对所有未知字节进行反斜杠转义。这只适用于Python 3，因此即使使用此解决方案，您仍然会从不同的Python版本获得不一致的输出：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

有关详细信息，请参阅Python的Unicode支持。

更新20170119：我决定实现适用于Python 2和Python 3的斜杠转义解码。它应该比cp437解决方案慢，但它应该在每个Python版本上产生相同的结果。

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))

2014-12-17 14:23:09

如果出现此错误：

utf-8编解码器无法解码字节0x8a，

那么最好使用以下代码将字节转换为字符串：

bytes = b"abcdefg"
string = bytes.decode("utf-8", "ignore")

2021-10-21 06:36:44

将字节转换为字符串

推荐文章

最新文章

标签