使用Javascript的atob解码base64不能正确解码utf-8字符串

我使用Javascript window.atob()函数来解码base64编码的字符串(特别是来自GitHub API的base64编码的内容)。问题是我得到ascii编码的字符返回(如âⅱ而不是™)。我如何正确地处理传入的base64编码流，以便将其解码为utf-8?

当前回答

解码base64到UTF8字符串

以下是@brandonscript目前投票最多的答案

function b64DecodeUnicode(str) {
    // Going backwards: from bytestream, to percent-encoding, to original string.
    return decodeURIComponent(atob(str).split('').map(function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
    }).join(''));
}

上面的代码可以工作，但是非常慢。如果您的输入是一个非常大的base64字符串，例如，对于一个base64 html文档，30,000个字符。这需要大量的计算。

这是我的答案，使用内置的TextDecoder，比上面的大输入代码快近10倍。

function decodeBase64(base64) {
    const text = atob(base64);
    const length = text.length;
    const bytes = new Uint8Array(length);
    for (let i = 0; i < length; i++) {
        bytes[i] = text.charCodeAt(i);
    }
    const decoder = new TextDecoder(); // default is utf-8
    return decoder.decode(bytes);
}

2020-11-09 13:09:48

其他回答

统一码问题

虽然JavaScript (ECMAScript)已经成熟了，但是Base64、ASCII和Unicode编码的脆弱性已经造成了很多令人头痛的问题(在这个问题的历史中有很多)。

考虑下面的例子:

const ok = "a";
console.log(ok.codePointAt(0).toString(16)); //   61: occupies < 1 byte

const notOK = "✓"
console.log(notOK.codePointAt(0).toString(16)); // 2713: occupies > 1 byte

console.log(btoa(ok));    // YQ==
console.log(btoa(notOK)); // error

为什么我们会遇到这种情况?

根据设计，Base64期望二进制数据作为输入。就JavaScript字符串而言，这意味着每个字符只占用一个字节的字符串。因此，如果您向btoa()传递一个包含占用超过一个字节的字符的字符串，您将得到一个错误，因为这不是二进制数据。

来源:MDN (2021)

最初的MDN文章还介绍了窗口的破碎特性。btoa和.atob，在现代ECMAScript中已被修复。最初的，现已死亡的MDN文章解释道:

“统一码问题” 由于domstring是16位编码的字符串，在大多数浏览器中调用window。如果字符超出8位字节(0x00~0xFF)的范围，Unicode字符串上的btoa将导致字符超出范围异常。

具有二进制互操作性的解决方案

(继续滚动查找ASCII base64解决方案)

来源:MDN (2021)

MDN推荐的解决方案是实际对二进制字符串表示进行编码:

编码UTF8⇢二进制

// convert a Unicode string to a string in which
// each 16-bit unit occupies only one byte
function toBinary(string) {
  const codeUnits = new Uint16Array(string.length);
  for (let i = 0; i < codeUnits.length; i++) {
    codeUnits[i] = string.charCodeAt(i);
  }
  return btoa(String.fromCharCode(...new Uint8Array(codeUnits.buffer)));
}

// a string that contains characters occupying > 1 byte
let encoded = toBinary("✓ à la mode") // "EycgAOAAIABsAGEAIABtAG8AZABlAA=="

解码二进制⇢UTF-8

function fromBinary(encoded) {
  const binary = atob(encoded);
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < bytes.length; i++) {
    bytes[i] = binary.charCodeAt(i);
  }
  return String.fromCharCode(...new Uint16Array(bytes.buffer));
}

// our previous Base64-encoded string
let decoded = fromBinary(encoded) // "✓ à la mode"

这有点失败，您会注意到编码的字符串EycgAOAAIABsAGEAIABtAG8AZABlAA==不再匹配前一个解决方案的字符串4pyTIMOgIGxhIG1vZGU=。这是因为它是二进制编码的字符串，而不是UTF-8编码的字符串。如果这对您来说无关紧要(即，您没有转换来自另一个系统的以UTF-8表示的字符串)，那么您就可以开始了。但是，如果希望保留UTF-8功能，最好使用下面描述的解决方案。

解决方案与ASCII base64互操作

这个问题的整个历史表明，多年来我们有多少种不同的方法来解决破碎的编码系统。虽然最初的MDN文章已经不存在了，但这个解决方案仍然可以说是一个更好的解决方案，并且在解决“Unicode问题”方面做得很好，同时保持了可以在base64decode.org上解码的纯文本base64字符串。

解决这个问题有两种可能的方法:

第一个是转义整个字符串(使用UTF-8，参见encodeURIComponent)，然后对它进行编码; 第二步是将UTF-16 DOMString转换为UTF-8字符数组，然后对其进行编码。

关于以前的解决方案的注意事项:MDN文章最初建议使用unescape和转义来解决字符超出范围异常问题，但它们已被弃用。这里的一些其他答案建议使用decodeURIComponent和encodeURIComponent来解决这个问题，这已经被证明是不可靠和不可预测的。这个答案的最新更新使用了现代JavaScript函数来提高速度和现代化代码。

如果你想节省自己的时间，你也可以考虑使用库:

js-base64 (NPM，非常适合Node.js) base64-js

编码UTF8⇢base64

    function b64EncodeUnicode(str) {
        // first we use encodeURIComponent to get percent-encoded UTF-8,
        // then we convert the percent encodings into raw bytes which
        // can be fed into btoa.
        return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g,
            function toSolidBytes(match, p1) {
                return String.fromCharCode('0x' + p1);
        }));
    }
    
    b64EncodeUnicode('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
    b64EncodeUnicode('\n'); // "Cg=="

解码base64⇢UTF8

    function b64DecodeUnicode(str) {
        // Going backwards: from bytestream, to percent-encoding, to original string.
        return decodeURIComponent(atob(str).split('').map(function(c) {
            return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
        }).join(''));
    }
    
    b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"
    b64DecodeUnicode('Cg=='); // "\n"

(我们为什么要这样做?('00' + c. charcodeat (0). tostring (16)).slice(-2)将0前置到单个字符串，例如当c == \n时，c. charcodeat (0). tostring(16)返回a，迫使a表示为0a)。

打印稿的支持

下面是相同的解决方案，但增加了一些TypeScript兼容性(通过@MA-Maddin):

// Encoding UTF8 ⇢ base64

function b64EncodeUnicode(str) {
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
        return String.fromCharCode(parseInt(p1, 16))
    }))
}

// Decoding base64 ⇢ UTF8

function b64DecodeUnicode(str) {
    return decodeURIComponent(Array.prototype.map.call(atob(str), function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2)
    }).join(''))
}

第一个解决方案(已弃用)

使用escape和unescape(现在已弃用，但在所有现代浏览器中仍然有效):

function utf8_to_b64( str ) {
    return window.btoa(unescape(encodeURIComponent( str )));
}

function b64_to_utf8( str ) {
    return decodeURIComponent(escape(window.atob( str )));
}

// Usage:
utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

最后一件事:我第一次遇到这个问题是在调用GitHub API时。为了让它在(移动)Safari上正常工作，我实际上不得不在解码源代码之前从base64源代码中剥离所有空白。我不知道这在2021年是否仍有意义:

function b64_to_utf8( str ) {
    str = str.replace(/\s/g, '');    
    return decodeURIComponent(escape(window.atob( str )));
}

2015-05-07 16:16:58

这是我的一行程序解决方案，结合了Jackie Hans的答案和另一个问题的一些代码:

const utf8_encoded_text = new TextDecoder().decode(Uint8Array.from(window.atob(base_64_decoded_text).split("").map(x => x.charCodeAt(0))));

2022-11-15 16:43:57

小修正，unescape和escape已弃用，因此:

function utf8_to_b64( str ) {
    return window.btoa(decodeURIComponent(encodeURIComponent(str)));
}

function b64_to_utf8( str ) {
     return decodeURIComponent(encodeURIComponent(window.atob(str)));
}


function b64_to_utf8( str ) {
    str = str.replace(/\s/g, '');    
    return decodeURIComponent(encodeURIComponent(window.atob(str)));
}

2015-12-08 11:32:58

包括上述解决方案，如果仍然面临问题，尝试如下，考虑转义不支持TS的情况。

blob = new Blob(["\ufeff", csv_content]); // this will make symbols to appears in excel

对于csv_content，您可以像下面这样尝试。

function b64DecodeUnicode(str: any) {        
        return decodeURIComponent(atob(str).split('').map((c: any) => {
            return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
        }).join(''));
    }

2018-08-08 09:47:30

I would assume that one might want a solution that produces a widely useable base64 URI. Please visit data:text/plain;charset=utf-8;base64,4pi44pi54pi64pi74pi84pi+4pi/ to see a demonstration (copy the data uri, open a new tab, paste the data URI into the address bar, then press enter to go to the page). Despite the fact that this URI is base64-encoded, the browser is still able to recognize the high code points and decode them properly. The minified encoder+decoder is 1058 bytes (+Gzip→589 bytes)

!function(e){"use strict";function h(b){var a=b.charCodeAt(0);if(55296<=a&&56319>=a)if(b=b.charCodeAt(1),b===b&&56320<=b&&57343>=b){if(a=1024*(a-55296)+b-56320+65536,65535<a)return d(240|a>>>18,128|a>>>12&63,128|a>>>6&63,128|a&63)}else return d(239,191,189);return 127>=a?inputString:2047>=a?d(192|a>>>6,128|a&63):d(224|a>>>12,128|a>>>6&63,128|a&63)}function k(b){var a=b.charCodeAt(0)<<24,f=l(~a),c=0,e=b.length,g="";if(5>f&&e>=f){a=a<<f>>>24+f;for(c=1;c<f;++c)a=a<<6|b.charCodeAt(c)&63;65535>=a?g+=d(a):1114111>=a?(a-=65536,g+=d((a>>10)+55296,(a&1023)+56320)):c=0}for(;c<e;++c)g+="\ufffd";return g}var m=Math.log,n=Math.LN2,l=Math.clz32||function(b){return 31-m(b>>>0)/n|0},d=String.fromCharCode,p=atob,q=btoa;e.btoaUTF8=function(b,a){return q((a?"\u00ef\u00bb\u00bf":"")+b.replace(/[\x80-\uD7ff\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]?/g,h))};e.atobUTF8=function(b,a){a||"\u00ef\u00bb\u00bf"!==b.substring(0,3)||(b=b.substring(3));return p(b).replace(/[\xc0-\xff][\x80-\xbf]*/g,k)}}(""+void 0==typeof global?""+void 0==typeof self?this:self:global)

下面是用于生成它的源代码。

var fromCharCode = String.fromCharCode;
var btoaUTF8 = (function(btoa, replacer){"use strict";
    return function(inputString, BOMit){
        return btoa((BOMit ? "\xEF\xBB\xBF" : "") + inputString.replace(
            /[\x80-\uD7ff\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]?/g, replacer
        ));
    }
})(btoa, function(nonAsciiChars){"use strict";
    // make the UTF string into a binary UTF-8 encoded string
    var point = nonAsciiChars.charCodeAt(0);
    if (point >= 0xD800 && point <= 0xDBFF) {
        var nextcode = nonAsciiChars.charCodeAt(1);
        if (nextcode !== nextcode) // NaN because string is 1 code point long
            return fromCharCode(0xef/*11101111*/, 0xbf/*10111111*/, 0xbd/*10111101*/);
        // https://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae
        if (nextcode >= 0xDC00 && nextcode <= 0xDFFF) {
            point = (point - 0xD800) * 0x400 + nextcode - 0xDC00 + 0x10000;
            if (point > 0xffff)
                return fromCharCode(
                    (0x1e/*0b11110*/<<3) | (point>>>18),
                    (0x2/*0b10*/<<6) | ((point>>>12)&0x3f/*0b00111111*/),
                    (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/),
                    (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/)
                );
        } else return fromCharCode(0xef, 0xbf, 0xbd);
    }
    if (point <= 0x007f) return nonAsciiChars;
    else if (point <= 0x07ff) {
        return fromCharCode((0x6<<5)|(point>>>6), (0x2<<6)|(point&0x3f));
    } else return fromCharCode(
        (0xe/*0b1110*/<<4) | (point>>>12),
        (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/),
        (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/)
    );
});

然后，要解码base64数据，HTTP可以将数据作为数据URI获取，也可以使用下面的函数。

var clz32 = Math.clz32 || (function(log, LN2){"use strict";
    return function(x) {return 31 - log(x >>> 0) / LN2 | 0};
})(Math.log, Math.LN2);
var fromCharCode = String.fromCharCode;
var atobUTF8 = (function(atob, replacer){"use strict";
    return function(inputString, keepBOM){
        inputString = atob(inputString);
        if (!keepBOM && inputString.substring(0,3) === "\xEF\xBB\xBF")
            inputString = inputString.substring(3); // eradicate UTF-8 BOM
        // 0xc0 => 0b11000000; 0xff => 0b11111111; 0xc0-0xff => 0b11xxxxxx
        // 0x80 => 0b10000000; 0xbf => 0b10111111; 0x80-0xbf => 0b10xxxxxx
        return inputString.replace(/[\xc0-\xff][\x80-\xbf]*/g, replacer);
    }
})(atob, function(encoded){"use strict";
    var codePoint = encoded.charCodeAt(0) << 24;
    var leadingOnes = clz32(~codePoint);
    var endPos = 0, stringLen = encoded.length;
    var result = "";
    if (leadingOnes < 5 && stringLen >= leadingOnes) {
        codePoint = (codePoint<<leadingOnes)>>>(24+leadingOnes);
        for (endPos = 1; endPos < leadingOnes; ++endPos)
            codePoint = (codePoint<<6) | (encoded.charCodeAt(endPos)&0x3f/*0b00111111*/);
        if (codePoint <= 0xFFFF) { // BMP code point
          result += fromCharCode(codePoint);
        } else if (codePoint <= 0x10FFFF) {
          // https://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae
          codePoint -= 0x10000;
          result += fromCharCode(
            (codePoint >> 10) + 0xD800,  // highSurrogate
            (codePoint & 0x3ff) + 0xDC00 // lowSurrogate
          );
        } else endPos = 0; // to fill it in with INVALIDs
    }
    for (; endPos < stringLen; ++endPos) result += "\ufffd"; // replacement character
    return result;
});

更标准的优点是这个编码器和这个解码器更广泛地适用，因为它们可以用作正确显示的有效URL。观察。

In addition to being very standardized, the above code snippets are also very fast. Instead of an indirect chain of succession where the data has to be converted several times between various forms (such as in Riccardo Galli's response), the above code snippet is as direct as performantly possible. It uses only one simple fast String.prototype.replace call to process the data when encoding, and only one to decode the data when decoding. Another plus is that (especially for big strings), String.prototype.replace allows the browser to automatically handle the underlying memory management of resizing the string, leading a significant performance boost especially in evergreen browsers like Chrome and Firefox that heavily optimize String.prototype.replace. Finally, the icing on the cake is that for you latin script exclūsīvō users, strings which don't contain any code points above 0x7f are extra fast to process because the string remains unmodified by the replacement algorithm.

我已经在https://github.com/anonyco/BestBase64EncoderDecoder/为这个解决方案创建了一个github存储库

2018-11-22 14:52:26

使用Javascript的atob解码base64不能正确解码utf-8字符串

推荐文章

最新文章

标签