下面这个python示例,我将一个字符串编码为Base64:

>>> import base64
>>> encoded = base64.b64encode(b'data to be encoded')
>>> encoded
b'ZGF0YSB0byBiZSBlbmNvZGVk'

但是,如果我省略前导b:

>>> encoded = base64.b64encode('data to be encoded')

我得到以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\lib\base64.py", line 56, in b64encode
   raise TypeError("expected bytes, not %s" % s.__class__.__name__)
   TypeError: expected bytes, not str

为什么会这样?


当前回答

如果字符串是Unicode,最简单的方法是:

import base64                                                        

a = base64.b64encode(bytes(u'complex string: ñáéíóúÑ', "utf-8"))

# a: b'Y29tcGxleCBzdHJpbmc6IMOxw6HDqcOtw7PDusOR'

b = base64.b64decode(a).decode("utf-8", "ignore")                    

print(b)
# b :complex string: ñáéíóúÑ

其他回答

简短的回答

您需要将一个类字节对象(bytes, bytearray等)推入base64.b64encode()方法。这里有两种方法:

>>> import base64
>>> data = base64.b64encode(b'data to be encoded')
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'

或者用一个变量:

>>> import base64
>>> string = 'data to be encoded'
>>> data = base64.b64encode(string.encode())
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'

Why?

In Python 3, str objects are not C-style character arrays (so they are not byte arrays), but rather, they are data structures that do not have any inherent encoding. You can encode that string (or interpret it) in a variety of ways. The most common (and default in Python 3) is utf-8, especially since it is backwards compatible with ASCII (although, as are most widely-used encodings). That is what is happening when you take a string and call the .encode() method on it: Python is interpreting the string in utf-8 (the default encoding) and providing you the array of bytes that it corresponds to.

Python中的Base-64编码

最初题目问的是Base-64编码。继续阅读有关Base-64的内容。

base64 encoding takes 6-bit binary chunks and encodes them using the characters A-Z, a-z, 0-9, '+', '/', and '=' (some encodings use different characters in place of '+' and '/'). This is a character encoding that is based off of the mathematical construct of radix-64 or base-64 number system, but they are very different. Base-64 in math is a number system like binary or decimal, and you do this change of radix on the entire number, or (if the radix you're converting from is a power of 2 less than 64) in chunks from right to left.

在base64编码中,转换是从左到右进行的;这前64个字符就是为什么它被称为base64编码。第65个'='符号用于填充,因为编码提取6位块,但它通常意味着编码的数据是8位字节,因此有时最后一个块中只有2位或4位。

例子:

>>> data = b'test'
>>> for byte in data:
...     print(format(byte, '08b'), end=" ")
...
01110100 01100101 01110011 01110100
>>>

如果你将二进制数据解释为单个整数,那么你将如何将它转换为base-10和base-64(表为base-64):

base-2:  01 110100 011001 010111 001101 110100 (base-64 grouping shown)
base-10:                            1952805748
base-64:  B      0      Z      X      N      0

然而,Base64编码将重新分组此数据:

base-2:  011101  000110  010101 110011 011101 00(0000) <- pad w/zeros to make a clean 6-bit chunk
base-10:     29       6      21     51     29      0
base-64:      d       G       V      z      d      A

So, 'B0ZXN0' is the base-64 version of our binary, mathematically speaking. However, base64 encoding has to do the encoding in the opposite direction (so the raw data is converted to 'dGVzdA') and also has a rule to tell other applications how much space is left off at the end. This is done by padding the end with '=' symbols. So, the base64 encoding of this data is 'dGVzdA==', with two '=' symbols to signify two pairs of bits will need to be removed from the end when this data gets decoded to make it match the original data.

让我们来测试一下,看看我是否不诚实:

>>> encoded = base64.b64encode(data)
>>> print(encoded)
b'dGVzdA=='

为什么使用base64编码?

假设我要通过电子邮件给某人发送一些数据,比如这个数据:

>>> data = b'\x04\x6d\x73\x67\x08\x08\x08\x20\x20\x20'
>>> print(data.decode())
   
>>> print(data)
b'\x04msg\x08\x08\x08   '
>>>

我制造了两个问题:

If I tried to send that email in Unix, the email would send as soon as the \x04 character was read, because that is ASCII for END-OF-TRANSMISSION (Ctrl-D), so the remaining data would be left out of the transmission. Also, while Python is smart enough to escape all of my evil control characters when I print the data directly, when that string is decoded as ASCII, you can see that the 'msg' is not there. That is because I used three BACKSPACE characters and three SPACE characters to erase the 'msg'. Thus, even if I didn't have the EOF character there the end user wouldn't be able to translate from the text on screen to the real, raw data.

这只是一个演示,向您展示简单地发送原始数据有多么困难。将数据编码为base64格式可以得到完全相同的数据,但格式可以确保通过电子媒体(如电子邮件)发送数据是安全的。

如果字符串是Unicode,最简单的方法是:

import base64                                                        

a = base64.b64encode(bytes(u'complex string: ñáéíóúÑ', "utf-8"))

# a: b'Y29tcGxleCBzdHJpbmc6IMOxw6HDqcOtw7PDusOR'

b = base64.b64decode(a).decode("utf-8", "ignore")                    

print(b)
# b :complex string: ñáéíóúÑ

如果要编码的数据包含“外来”字符,我认为您必须以“UTF-8”编码。

encoded = base64.b64encode (bytes('data to be encoded', "utf-8"))

base64编码采用8位二进制字节数据,并仅使用字符A-Z, A-Z, 0-9, +, /*进行编码,因此可以通过不保留所有8位数据的通道传输,例如电子邮件。

因此,它需要一个8位字节的字符串。你可以在Python 3中使用b "语法创建它们。

如果去掉b,它就变成了字符串。字符串是Unicode字符序列。base64不知道如何处理Unicode数据,它不是8位的。事实上,它并不是真正的比特。:-)

在第二个例子中:

>>> encoded = base64.b64encode('data to be encoded')

所有字符都完全符合ASCII字符集,因此base64编码实际上有点毫无意义。你可以用

>>> encoded = 'data to be encoded'.encode('ascii')

或者更简单:

>>> encoded = b'data to be encoded'

在这种情况下也是一样的。


*大多数base64类型还可能在结尾包含一个=作为填充。此外,一些base64变体可能使用除+和/之外的字符。请参阅Wikipedia中的变体汇总表以获得概述。

这是你所需要的:

expected bytes, not str

前导的b使字符串变成二进制。

你用什么版本的Python ?2.X还是3.x?

编辑:查看http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit了解Python 3.x中字符串的血腥细节