一个Unicode字符需要多少字节?

我对编码有点困惑。据我所知，旧的ASCII字符每个字符占用一个字节。一个Unicode字符需要多少字节?

我假设一个Unicode字符可以包含任何语言的所有可能字符——我说的对吗?那么每个字符需要多少字节呢?

UTF-7、UTF-6、UTF-16等是什么意思?它们是Unicode的不同版本吗?

我读了维基百科上关于统一码的文章，但对我来说太难了。我期待看到一个简单的答案。

当前回答

有一个很好的工具可以计算UTF-8中任何字符串的字节数:http://mothereff.in/byte-counter

更新:@mathias已公开代码:https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js

2013-08-02 21:21:20

其他回答

好吧，我刚刚也打开了维基百科的页面，在介绍部分我看到“Unicode可以通过不同的字符编码实现。最常用的编码是UTF-8(它对任何ASCII字符使用一个字节，这些字符在UTF-8和ASCII编码中具有相同的编码值，对其他字符使用最多四个字节)，现在已经过时的UCS-2(它对每个字符使用两个字节，但不能对当前Unicode标准中的每个字符进行编码)。

正如这段引用所演示的，您的问题是假定Unicode是一种编码字符的单一方法。实际上有多种形式的Unicode，在引用中，其中一种甚至每个字符有一个字节，就像你习惯的那样。

所以你想要的简单答案是它是变化的。

2011-03-13 15:09:46

你不会看到一个简单的答案，因为根本就没有答案。

首先，Unicode并没有包含“每一种语言的每一个字符”，尽管它确实尝试了。

Unicode本身是一种映射，它定义码点，码点是一个数字，通常与一个字符相关联。我说通常是因为有像组合字符这样的概念。你可能对口音或变音很熟悉。它们可以与其他字符一起使用，例如a或u来创建一个新的逻辑字符。因此，一个字符可以由一个或多个码位组成。

To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can't be represented in the encoding at all (this is a problem for instance with UCS-2).

由于组合字符的灵活性，即使在给定的编码中，每个字符的字节数也可以根据字符和规范化形式而变化。这是一种用于处理具有多个表示的字符的协议(您可以说“带重音的'a'”是2个代码点，其中一个是组合字符或“带重音的'a'”是一个代码点)。

2011-03-13 15:19:29

在utf - 8:

1 byte:       0 -     7F     (ASCII)
2 bytes:     80 -    7FF     (all European plus some Middle Eastern)
3 bytes:    800 -   FFFF     (multilingual plane incl. the top 1792 and private-use)
4 bytes:  10000 - 10FFFF

在utf - 16:

2 bytes:      0 -   D7FF     (multilingual plane except the top 1792 and private-use )
4 bytes:   D800 - 10FFFF

在utf - 32:

4 bytes:      0 - 10FFFF

根据定义，10FFFF是最后一个unicode码位，这样定义是因为它是UTF-16的技术限制。

它也是UTF-8可以在4字节内编码的最大码点，但UTF-8编码背后的思想也适用于5字节和6字节编码，以覆盖码点，直到7FFFFFFF。只有UTF-32的一半。

2016-08-27 12:18:10

在Unicode中，答案是不容易给出的。正如您已经指出的，问题在于编码。

对于任何没有变音符字符的英语句子，UTF-8的答案将是字符的字节数，而UTF-16的答案将是字符数乘以2。

(到目前为止)我们可以声明大小的唯一编码是UTF-32。每个字符总是32位，即使我想象代码点是为未来的UTF-64准备的:)

至少有两件事让它如此困难:

composed characters, where instead of using the character entity that is already accented/diacritic (À), a user decided to combine the accent and the base character (`A). code points. Code points are the method by which the UTF-encodings allow to encode more than the number of bits that gives them their name would usually allow. E.g. UTF-8 designates certain bytes which on their own are invalid, but when followed by a valid continuation byte will allow to describe a character beyond the 8-bit range of 0..255. See the Examples and Overlong Encodings below in the Wikipedia article on UTF-8. The excellent example given there is that the € character (code point U+20AC can be represented either as three-byte sequence E2 82 AC or four-byte sequence F0 82 82 AC. Both are valid, and this shows how complicated the answer is when talking about "Unicode" and not about a specific encoding of Unicode, such as UTF-8 or UTF-16. Strictly speaking, as pointed out in a comment, this doesn't seem to be the case any longer or was even based on a misunderstanding on my part. The quote from the updated Wikipedia article reads: Longer encodings are called overlong and are not valid UTF-8 representations of the code point.

2011-03-13 15:10:34

对于UTF-16，如果字符以0xD800或更大开头，则需要四个字节(两个代码单元);这样的字符称为“代理对”。更具体地说，代理对的形式是:

[0xD800 - 0xDBFF]  [0xDC00 - 0xDFF]

在[…]表示给定范围的双字节代码单元。任何<= 0xD7FF的值都是一个代码单元(两个字节)。任何>= 0xE000都是无效的(BOM标记除外)。

参见http://unicodebook.readthedocs.io/unicode_encodings.html，第7.5节。

2016-07-12 20:45:30

一个Unicode字符需要多少字节?

推荐文章

最新文章

标签