将字节转换为字符串

我将外部程序的标准输出捕获到一个字节对象中：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>>
>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

我想把它转换成一个普通的Python字符串，这样我就可以这样打印了：

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

如何使用Python 3将字节对象转换为str？

解码字节对象以生成字符串：

>>> b"abcde".decode("utf-8") 
'abcde'

上面的示例假设字节对象是UTF-8格式的，因为它是一种常见的编码。但是，您应该使用数据实际使用的编码！

2009-03-03 12:26:18

解码字节字符串并将其转换为字符（Unicode）字符串。

Python 3：

encoding = 'utf-8'
b'hello'.decode(encoding)

str(b'hello', encoding)

Python 2：

encoding = 'utf-8'
'hello'.decode(encoding)

unicode('hello', encoding)

2009-03-03 12:28:31

我想你真的想要这样：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron的回答是正确的，只是你需要知道使用哪种编码。我相信Windows使用的是“Windows-1252”。只有当你的内容中有一些不寻常的（非ASCII）字符时，这才是重要的，但这会产生影响。

顺便说一句，这一点很重要，这是Python转而使用两种不同类型的二进制数据和文本数据的原因：它无法在它们之间进行神奇的转换，因为除非你告诉它，否则它不知道编码！您知道的唯一方法是阅读Windows文档（或在此处阅读）。

2011-07-18 19:51:15

这将字节列表合并为字符串：

>>> bytes_data = [112, 52, 52]
>>> "".join(map(chr, bytes_data))
'p44'

2012-08-22 12:57:08

从系统-系统特定参数和功能：

要从标准流中写入或读取二进制数据，请使用底层二进制缓冲区。例如，要将字节写入stdout，请使用sys.stdout.buffer.write（b'abc'）。

2014-01-11 07:15:18

将universal_newlines设置为True，即。

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

2014-01-21 15:31:09

如果您不知道编码，那么要以Python 3和Python 2兼容的方式将二进制输入读取为字符串，请使用古老的MS-DOS CP437编码：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

因为编码未知，所以期望非英语符号转换为cp437字符（英语字符不被转换，因为它们在大多数单字节编码和UTF-8中都匹配）。

将任意二进制输入解码为UTF-8是不安全的，因为您可能会得到以下结果：

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

这同样适用于latin-1，它在Python 2中很流行（默认？）。查看Codepage Layout中缺少的点——Python就是在这里用不在范围内的臭名昭著的序数词窒息的。

更新20150604:有传言称，Python 3具有将数据编码为二进制数据而不会丢失和崩溃的替代性错误策略，但它需要转换测试[binary]->[str]->[binary]来验证性能和可靠性。

更新20170116：感谢Nearoo的评论-也有可能使用反斜杠替换错误处理程序对所有未知字节进行反斜杠转义。这只适用于Python 3，因此即使使用此解决方案，您仍然会从不同的Python版本获得不一致的输出：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

有关详细信息，请参阅Python的Unicode支持。

更新20170119：我决定实现适用于Python 2和Python 3的斜杠转义解码。它应该比cp437解决方案慢，但它应该在每个Python版本上产生相同的结果。

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))

2014-12-17 14:23:09

虽然@Aaron Maenpaa的回答很有效，但一位用户最近问道：

还有什么更简单的方法吗？”fhand.read（）.decode（“ASCII”）'[…]太长了！

您可以使用：

command_stdout.decode()

decode（）有一个标准参数：

codec.decode（obj，编码='utf-8'，错误='strict'）

2015-11-13 10:24:21

在Python 3中，默认编码为“utf-8”，因此可以直接使用：

b'hello'.decode()

相当于

b'hello'.decode(encoding="utf-8")

另一方面，在Python 2中，编码默认为默认字符串编码。因此，您应该使用：

b'hello'.decode(encoding)

其中编码是所需的编码。

注意：Python 2.7中添加了对关键字参数的支持。

2016-06-29 14:21:21

要将字节序列解释为文本，您必须知道对应字符编码：

unicode_text = bytestring.decode(character_encoding)

例子：

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls命令可能产生无法解释为文本的输出。文件名在Unix上可以是除斜杠b'/'和零之外的任何字节序列b“\0”：

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

尝试使用utf-8编码解码此类字节汤会引发UnicodeDecodeError。

可能会更糟。解码可能会无声地失败并产生mojibake如果使用错误的不兼容编码：

>>> '—'.encode('utf-8').decode('cp1252')
'â€”'

数据已损坏，但您的程序仍然没有意识到故障已发生。

通常，要使用的字符编码不会嵌入字节序列本身。你必须在乐队外传达这些信息。某些结果比其他结果更有可能，因此存在可以猜测字符编码的chardet模块。一个Python脚本可以在不同的地方使用多个字符编码。

ls输出可以使用os.fsdecode（）转换为Python字符串即使对于不可编码的函数也会成功文件名（它使用上的sys.getfilesystemencoding（）和surrogateescape错误处理程序Unix）：

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

要获取原始字节，可以使用os.fencoder（）。

如果传递universal_newlines=True参数，则子流程使用locale.getpreferredencoding（False）以解码字节，例如，它可以是Windows上的cp1252。

为了动态解码字节流，io.TextIOWrapper（）例如。

不同的命令可能使用不同的字符编码例如dir内部命令（cmd）的输出可以使用cp437。解码其输出，可以显式传递编码（Python 3.6+）：

output = subprocess.check_output('dir', shell=True, encoding='cp437')

文件名可能不同于os.listdir（）（它使用WindowsUnicode API），例如，“\xb6”可以替换为“\x14”-Python的cp437编解码器将b'\x14'映射到控制字符U+0014，而不是U+00B6（¶）。要支持具有任意Unicode字符的文件名，请参阅将可能包含非ASCII Unicode字符的PowerShell输出解码为Python字符串

2016-11-16 09:43:26

对于Python 3，这是一种从字节转换为字符串的更安全和Python方法：

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): # Check if it's in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')

输出：

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

2017-01-18 07:21:09

如果您应该通过尝试decode（）获得以下结果：

AttributeError:“str”对象没有属性“decode”

也可以直接在转换中指定编码类型：

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'

2017-11-22 04:20:55

使用Windows系统中的数据（以行结尾）时，我的答案是

String = Bytes.decode("utf-8").replace("\r\n", "\n")

为什么？尝试使用多行Input.txt：

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)

所有的行尾都将加倍（到\r\n），导致多余的空行。Python的文本读取函数通常规范化行结尾，以便字符串只使用\n。如果您从Windows系统接收二进制数据，Python就没有机会这样做。因此

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)

将复制原始文件。

2018-03-16 13:28:25

由于这个问题实际上是关于子流程输出的，所以您可以使用更直接的方法。最现代的方法是使用subprocess.check_output并传递text=True（Python 3.7+），使用系统默认编码自动解码stdout：

text = subprocess.check_output(["ls", "-l"], text=True)

对于Python 3.6，Popen接受一个编码关键字：

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

如果您不处理子流程输出，标题中问题的一般答案是将字节解码为文本：

>>> b'abcde'.decode()
'abcde'

如果没有参数，将使用sys.getdefaultencoding（）。如果数据不是sys.getdefaultencoding（），则必须在decode调用中显式指定编码：

>>> b'caf\xe9'.decode('cp1250')
'café'

2018-05-31 17:52:19

def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))

2018-06-03 22:44:45

如果要转换任何字节，而不仅仅是转换为字节的字符串：

with open("bytesfile", "rb") as infile:
    str = base64.b85encode(imageFile.read())

with open("bytesfile", "rb") as infile:
    str2 = json.dumps(list(infile.read()))

然而，这并不是很有效。它会将2 MB的图片转换为9 MB。

2019-06-01 02:30:56

对于“运行shell命令并将其输出作为文本而不是字节”的特定情况，在Python 3.7上，应该使用subprocess.run并传入text=True（以及capture_output=True来捕获输出）

command_result = subprocess.run(["ls", "-l"], capture_output=True, text=True)
command_result.stdout  # is a `str` containing your program's stdout

文本过去被称为universal_newlines，在Python 3.7中被更改（嗯，别名）。如果希望支持3.7之前的Python版本，请传入universal_newlines=True而不是text=True

2019-08-07 14:15:31

试试看：

bytes.fromhex('c3a9').decode('utf-8')

2020-01-19 08:19:02

尝试使用这个；此函数将忽略所有非字符集（如UTF-8）二进制文件，并返回一个干净的字符串。它针对Python 3.6及更高版本进行了测试。

def bin2str(text, encoding = 'utf-8'):
    """Converts a binary to Unicode string by removing all non Unicode char
    text: binary string to work on
    encoding: output encoding *utf-8"""

    return text.decode(encoding, 'ignore')

在这里，函数将获取二进制并对其进行解码（使用Python预定义的字符集将二进制数据转换为字符，忽略参数忽略二进制中的所有非字符集数据，并最终返回所需的字符串值）。

如果您不确定编码，请使用sys.getdefaultencoding（）获取设备的默认编码。

2021-05-18 19:07:58

如果出现此错误：

utf-8编解码器无法解码字节0x8a，

那么最好使用以下代码将字节转换为字符串：

bytes = b"abcdefg"
string = bytes.decode("utf-8", "ignore")

2021-10-21 06:36:44

我们可以使用bytes.decode（encoding='utf-8'，errors='strict'）对bytes对象进行解码以生成字符串。有关文档，请参阅bytes.decode。

Python 3示例：

byte_value = b"abcde"
print("Initial value = {}".format(byte_value))
print("Initial value type = {}".format(type(byte_value)))
string_value = byte_value.decode("utf-8")
# utf-8 is used here because it is a very common encoding, but you need to use the encoding your data is actually in.
print("------------")
print("Converted value = {}".format(string_value))
print("Converted value type = {}".format(type(string_value)))

输出：

Initial value = b'abcde'
Initial value type = <class 'bytes'>
------------
Converted value = abcde
Converted value type = <class 'str'>

注意：在Python 3中，默认情况下编码类型为UTF-8。因此，<byte_string>.decode（“utf-8”）也可以写成<byte-string>.decode（）

2022-02-23 12:52:03

字节

m=b'This is bytes'

转换为字符串

方法1

m.decode("utf-8")

m.decode()

方法2

import codecs
codecs.decode(m,encoding="utf-8")

import codecs
codecs.decode(m)

方法3

str(m,encoding="utf-8")

str(m)[2:-1]

后果

'This is bytes'

2022-06-21 13:18:28

将字节转换为字符串

推荐文章

最新文章

标签