JSON格式本身不支持二进制数据。二进制数据必须转义,以便可以将其放在JSON中的字符串元素中(即使用反斜杠转义的双引号中的零或多个Unicode字符)。
转义二进制数据的一个明显方法是使用Base64。然而,Base64有很高的处理开销。此外,它将3个字节扩展为4个字符,导致数据大小增加约33%。
其中一个用例是CDMI云存储API规范的0.8版草案。您可以使用JSON通过REST-Webservice创建数据对象,例如:
PUT /MyContainer/BinaryObject HTTP/1.1
Host: cloud.example.com
Accept: application/vnd.org.snia.cdmi.dataobject+json
Content-Type: application/vnd.org.snia.cdmi.dataobject+json
X-CDMI-Specification-Version: 1.0
{
"mimetype" : "application/octet-stream",
"metadata" : [ ],
"value" : "TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlz
IHNpbmd1bGFyIHBhc3Npb24gZnJvbSBvdGhlciBhbmltYWxzLCB3aGljaCBpcyBhIGx1c3Qgb2Yg
dGhlIG1pbmQsIHRoYXQgYnkgYSBwZXJzZXZlcmFuY2Ugb2YgZGVsaWdodCBpbiB0aGUgY29udGlu
dWVkIGFuZCBpbmRlZmF0aWdhYmxlIGdlbmVyYXRpb24gb2Yga25vd2xlZGdlLCBleGNlZWRzIHRo
ZSBzaG9ydCB2ZWhlbWVuY2Ugb2YgYW55IGNhcm5hbCBwbGVhc3VyZS4=",
}
是否有更好的方法和标准方法将二进制数据编码为JSON字符串?
While it is true that base64 has ~33% expansion rate, it is not necessarily true that processing overhead is significantly more than this: it really depends on JSON library/toolkit you are using. Encoding and decoding are simple straight-forward operations, and they can even be optimized wrt character encoding (as JSON only supports UTF-8/16/32) -- base64 characters are always single-byte for JSON String entries.
For example on Java platform there are libraries that can do the job rather efficiently, so that overhead is mostly due to expanded size.
我同意之前的两个答案:
base64是简单的,常用的标准,所以不太可能找到更好的标准来与JSON一起使用(base-85用于postscript等;但仔细想想,这些好处充其量只是边际的)
编码前压缩(解码后压缩)可能很有意义,这取决于您使用的数据
UTF-8的问题在于它不是空间利用率最高的编码。另外,一些随机二进制字节序列是无效的UTF-8编码。因此,您不能将随机二进制字节序列解释为一些UTF-8数据,因为它将是无效的UTF-8编码。这种约束对UTF-8编码的好处是,它使其健壮,并且可以定位我们开始查看的任何字节的开始和结束的多字节字符。
因此,如果在[0..]范围内对字节值进行编码。127]在UTF-8编码中只需要一个字节,编码范围为[128..]255]需要2个字节!
比这更糟。在JSON中,控制字符“和\不允许出现在字符串中。因此二进制数据需要进行一些转换才能正确编码。
我们看到的。如果我们假设在二进制数据中均匀分布随机字节值,那么平均而言,一半字节将被编码为一个字节,另一半字节将被编码为两个字节。UTF-8编码的二进制数据将是初始大小的150%。
Base64编码只增长到初始大小的133%。所以Base64编码更有效。
What about using another Base encoding ? In UTF-8, encoding the 128 ASCII values is the most space efficient. In 8 bits you can store 7 bits. So if we cut the binary data in 7 bit chunks to store them in each byte of an UTF-8 encoded string, the encoded data would grow only to 114% of the initial size. Better than Base64. Unfortunately we can't use this easy trick because JSON doesn't allow some ASCII chars. The 33 control characters of ASCII ( [0..31] and 127) and the " and \ must be excluded. This leaves us only 128-35 = 93 chars.
因此,理论上我们可以定义Base93编码,将编码的大小增加到8/log2(93) = 8*log10(2)/log10(93) = 122%。但是Base93编码不像Base64编码那么方便。Base64需要将输入字节序列切割成6位块,因此简单的逐位操作就可以很好地工作。133%比122%高不了多少。
这就是为什么我独立地得出了一个共同的结论,即Base64确实是在JSON中编码二进制数据的最佳选择。我的回答为它提供了一个理由。我同意从性能的角度来看,它不是很吸引人,但也考虑到使用JSON的好处,它的人类可读的字符串表示在所有编程语言中都很容易操作。
如果性能比较关键,则应该考虑使用纯二进制编码来替代JSON。但是对于JSON,我的结论是Base64是最好的。
在深度上
I dig a little bit more (during implementation of base128), and expose that when we send characters which ascii codes are bigger than 128 then browser (chrome) in fact send TWO characters (bytes) instead one :(. The reason is that JSON by defaul use utf8 characters for which characters with ascii codes above 127 are coded by two bytes what was mention by chmike answer. I made test in this way: type in chrome url bar chrome://net-export/ , select "Include raw bytes", start capturing, send POST requests (using snippet at the bottom), stop capturing and save json file with raw requests data. Then we look inside that json file:
We can find our base64 request by finding string 4142434445464748494a4b4c4d4e this is hex coding of ABCDEFGHIJKLMN and we will see that "byte_count": 639 for it.
We can find our above127 request by finding string C2BCC2BDC380C381C382C383C384C385C386C387C388C389C38AC38B this are request-hex utf8 codes of characters ¼½ÀÁÂÃÄÅÆÇÈÉÊË (however the ascii hex codes of this characters are c1c2c3c4c5c6c7c8c9cacbcccdce). The "byte_count": 703 so it is 64bytes longer than base64 request because characters with ascii codes above 127 are code by 2 bytes in request :(
所以事实上,发送带有代码>127的字符并没有什么好处。对于base64字符串,我们没有观察到这样的负面行为(可能对于base85也是如此-我不检查它)-然而,这个问题的一些解决方案将以POST multipart/form-data的二进制部分发送数据,在Ælex回答中描述(然而通常在这种情况下,我们根本不需要使用任何基本编码…)
另一种方法可能依赖于通过使用base65280 / base65k之类的代码将两个字节的数据部分映射到一个有效的utf8字符,但由于utf8规范,它可能不如base64有效……
function postBase64() {
let formData = new FormData();
let req = new XMLHttpRequest();
formData.append("base64ch", "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/");
req.open("POST", '/testBase64ch');
req.send(formData);
}
function postAbove127() {
let formData = new FormData();
let req = new XMLHttpRequest();
formData.append("above127", "¼½ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüý");
req.open("POST", '/testAbove127');
req.send(formData);
}
<button onclick=postBase64()>POST base64 chars</button>
<button onclick=postAbove127()>POST chars with codes>127</button>