将字节转换为字符串

我将外部程序的标准输出捕获到一个字节对象中：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>>
>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

我想把它转换成一个普通的Python字符串，这样我就可以这样打印了：

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

如何使用Python 3将字节对象转换为str？

当前回答

由于这个问题实际上是关于子流程输出的，所以您可以使用更直接的方法。最现代的方法是使用subprocess.check_output并传递text=True（Python 3.7+），使用系统默认编码自动解码stdout：

text = subprocess.check_output(["ls", "-l"], text=True)

对于Python 3.6，Popen接受一个编码关键字：

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

如果您不处理子流程输出，标题中问题的一般答案是将字节解码为文本：

>>> b'abcde'.decode()
'abcde'

如果没有参数，将使用sys.getdefaultencoding（）。如果数据不是sys.getdefaultencoding（），则必须在decode调用中显式指定编码：

>>> b'caf\xe9'.decode('cp1250')
'café'

2018-05-31 17:52:19

其他回答

解码字节对象以生成字符串：

>>> b"abcde".decode("utf-8") 
'abcde'

上面的示例假设字节对象是UTF-8格式的，因为它是一种常见的编码。但是，您应该使用数据实际使用的编码！

2009-03-03 12:26:18

将universal_newlines设置为True，即。

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

2014-01-21 15:31:09

对于“运行shell命令并将其输出作为文本而不是字节”的特定情况，在Python 3.7上，应该使用subprocess.run并传入text=True（以及capture_output=True来捕获输出）

command_result = subprocess.run(["ls", "-l"], capture_output=True, text=True)
command_result.stdout  # is a `str` containing your program's stdout

文本过去被称为universal_newlines，在Python 3.7中被更改（嗯，别名）。如果希望支持3.7之前的Python版本，请传入universal_newlines=True而不是text=True

2019-08-07 14:15:31

虽然@Aaron Maenpaa的回答很有效，但一位用户最近问道：

还有什么更简单的方法吗？”fhand.read（）.decode（“ASCII”）'[…]太长了！

您可以使用：

command_stdout.decode()

decode（）有一个标准参数：

codec.decode（obj，编码='utf-8'，错误='strict'）

2015-11-13 10:24:21

要将字节序列解释为文本，您必须知道对应字符编码：

unicode_text = bytestring.decode(character_encoding)

例子：

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls命令可能产生无法解释为文本的输出。文件名在Unix上可以是除斜杠b'/'和零之外的任何字节序列b“\0”：

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

尝试使用utf-8编码解码此类字节汤会引发UnicodeDecodeError。

可能会更糟。解码可能会无声地失败并产生mojibake如果使用错误的不兼容编码：

>>> '—'.encode('utf-8').decode('cp1252')
'â€”'

数据已损坏，但您的程序仍然没有意识到故障已发生。

通常，要使用的字符编码不会嵌入字节序列本身。你必须在乐队外传达这些信息。某些结果比其他结果更有可能，因此存在可以猜测字符编码的chardet模块。一个Python脚本可以在不同的地方使用多个字符编码。

ls输出可以使用os.fsdecode（）转换为Python字符串即使对于不可编码的函数也会成功文件名（它使用上的sys.getfilesystemencoding（）和surrogateescape错误处理程序Unix）：

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

要获取原始字节，可以使用os.fencoder（）。

如果传递universal_newlines=True参数，则子流程使用locale.getpreferredencoding（False）以解码字节，例如，它可以是Windows上的cp1252。

为了动态解码字节流，io.TextIOWrapper（）例如。

不同的命令可能使用不同的字符编码例如dir内部命令（cmd）的输出可以使用cp437。解码其输出，可以显式传递编码（Python 3.6+）：

output = subprocess.check_output('dir', shell=True, encoding='cp437')

文件名可能不同于os.listdir（）（它使用WindowsUnicode API），例如，“\xb6”可以替换为“\x14”-Python的cp437编解码器将b'\x14'映射到控制字符U+0014，而不是U+00B6（¶）。要支持具有任意Unicode字符的文件名，请参阅将可能包含非ASCII Unicode字符的PowerShell输出解码为Python字符串

2016-11-16 09:43:26

将字节转换为字符串

推荐文章

最新文章

标签