我对编码有点困惑。据我所知,旧的ASCII字符每个字符占用一个字节。一个Unicode字符需要多少字节?
我假设一个Unicode字符可以包含任何语言的所有可能字符——我说的对吗?那么每个字符需要多少字节呢?
UTF-7、UTF-6、UTF-16等是什么意思?它们是Unicode的不同版本吗?
我读了维基百科上关于统一码的文章,但对我来说太难了。我期待看到一个简单的答案。
我对编码有点困惑。据我所知,旧的ASCII字符每个字符占用一个字节。一个Unicode字符需要多少字节?
我假设一个Unicode字符可以包含任何语言的所有可能字符——我说的对吗?那么每个字符需要多少字节呢?
UTF-7、UTF-6、UTF-16等是什么意思?它们是Unicode的不同版本吗?
我读了维基百科上关于统一码的文章,但对我来说太难了。我期待看到一个简单的答案。
当前回答
在Unicode中,答案是不容易给出的。正如您已经指出的,问题在于编码。
对于任何没有变音符字符的英语句子,UTF-8的答案将是字符的字节数,而UTF-16的答案将是字符数乘以2。
(到目前为止)我们可以声明大小的唯一编码是UTF-32。每个字符总是32位,即使我想象代码点是为未来的UTF-64准备的:)
至少有两件事让它如此困难:
composed characters, where instead of using the character entity that is already accented/diacritic (À), a user decided to combine the accent and the base character (`A). code points. Code points are the method by which the UTF-encodings allow to encode more than the number of bits that gives them their name would usually allow. E.g. UTF-8 designates certain bytes which on their own are invalid, but when followed by a valid continuation byte will allow to describe a character beyond the 8-bit range of 0..255. See the Examples and Overlong Encodings below in the Wikipedia article on UTF-8. The excellent example given there is that the € character (code point U+20AC can be represented either as three-byte sequence E2 82 AC or four-byte sequence F0 82 82 AC. Both are valid, and this shows how complicated the answer is when talking about "Unicode" and not about a specific encoding of Unicode, such as UTF-8 or UTF-16. Strictly speaking, as pointed out in a comment, this doesn't seem to be the case any longer or was even based on a misunderstanding on my part. The quote from the updated Wikipedia article reads: Longer encodings are called overlong and are not valid UTF-8 representations of the code point.
其他回答
对于UTF-16,如果字符以0xD800或更大开头,则需要四个字节(两个代码单元);这样的字符称为“代理对”。更具体地说,代理对的形式是:
[0xD800 - 0xDBFF] [0xDC00 - 0xDFF]
在[…]表示给定范围的双字节代码单元。任何<= 0xD7FF的值都是一个代码单元(两个字节)。任何>= 0xE000都是无效的(BOM标记除外)。
参见http://unicodebook.readthedocs.io/unicode_encodings.html,第7.5节。
简单地说,Unicode是一种为世界上所有字符分配一个数字(称为码位)的标准(它仍在进行中)。
现在你需要用字节表示这些代码点,这叫做字符编码。UTF-8, UTF-16, UTF-6是表示这些字符的方法。
UTF-8是多字节字符编码。字符可以有1到6个字节(其中一些现在可能不需要)。
UTF-32每个字符有4个字节一个字符。
UTF-16为每个字符使用16位,它只表示称为BMP的Unicode字符的一部分(对于所有实际目的来说已经足够了)。Java在其字符串中使用这种编码。
奇怪的是,没有人指出如何计算一个Unicode字符占用多少字节。下面是UTF-8编码字符串的规则:
Binary Hex Comments
0xxxxxxx 0x00..0x7F Only byte of a 1-byte character encoding
10xxxxxx 0x80..0xBF Continuation byte: one of 1-3 bytes following the first
110xxxxx 0xC0..0xDF First byte of a 2-byte character encoding
1110xxxx 0xE0..0xEF First byte of a 3-byte character encoding
11110xxx 0xF0..0xF7 First byte of a 4-byte character encoding
所以简单的答案是:它需要1到4个字节,这取决于第一个将表明它将占用多少字节。
在Unicode中,答案是不容易给出的。正如您已经指出的,问题在于编码。
对于任何没有变音符字符的英语句子,UTF-8的答案将是字符的字节数,而UTF-16的答案将是字符数乘以2。
(到目前为止)我们可以声明大小的唯一编码是UTF-32。每个字符总是32位,即使我想象代码点是为未来的UTF-64准备的:)
至少有两件事让它如此困难:
composed characters, where instead of using the character entity that is already accented/diacritic (À), a user decided to combine the accent and the base character (`A). code points. Code points are the method by which the UTF-encodings allow to encode more than the number of bits that gives them their name would usually allow. E.g. UTF-8 designates certain bytes which on their own are invalid, but when followed by a valid continuation byte will allow to describe a character beyond the 8-bit range of 0..255. See the Examples and Overlong Encodings below in the Wikipedia article on UTF-8. The excellent example given there is that the € character (code point U+20AC can be represented either as three-byte sequence E2 82 AC or four-byte sequence F0 82 82 AC. Both are valid, and this shows how complicated the answer is when talking about "Unicode" and not about a specific encoding of Unicode, such as UTF-8 or UTF-16. Strictly speaking, as pointed out in a comment, this doesn't seem to be the case any longer or was even based on a misunderstanding on my part. The quote from the updated Wikipedia article reads: Longer encodings are called overlong and are not valid UTF-8 representations of the code point.
你不会看到一个简单的答案,因为根本就没有答案。
首先,Unicode并没有包含“每一种语言的每一个字符”,尽管它确实尝试了。
Unicode本身是一种映射,它定义码点,码点是一个数字,通常与一个字符相关联。我说通常是因为有像组合字符这样的概念。你可能对口音或变音很熟悉。它们可以与其他字符一起使用,例如a或u来创建一个新的逻辑字符。因此,一个字符可以由一个或多个码位组成。
To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can't be represented in the encoding at all (this is a problem for instance with UCS-2).
由于组合字符的灵活性,即使在给定的编码中,每个字符的字节数也可以根据字符和规范化形式而变化。这是一种用于处理具有多个表示的字符的协议(您可以说“带重音的'a'”是2个代码点,其中一个是组合字符或“带重音的'a'”是一个代码点)。