Old character encodings such as ASCII are from the (pre-) 8-bit era, and try to cram the dominant language in computing at the time, i.e. English, into numbers ranging from 0 to 127 (7 bits). With 26 letters in the alphabet, both in capital and non-capital form, numbers and punctuation signs, that worked pretty well. ASCII got extended by an 8th bit for other, non-English languages, but the additional 128 numbers/code points made available by this expansion would be mapped to different characters depending on the language being displayed. The ISO-8859 standards are the most common forms of this mapping; ISO-8859-1 and ISO-8859-15 (also known as ISO-Latin-1, latin1, and yes there are two different versions of the 8859 ISO standard as well).
本质上有两种不同类型的编码:一种是通过添加更多位来扩大值范围。这些编码的例子是UCS2(2字节= 16位)和UCS4(4字节= 32位)。它们与ASCII和ISO-8859标准存在本质上相同的问题,因为它们的值范围仍然有限,即使限制要高得多。
The other type of encoding uses a variable number of bytes per character, and the most commonly known encodings for this are the UTF encodings. All UTF encodings work in roughly the same manner: you choose a unit size, which for UTF-8 is 8 bits, for UTF-16 is 16 bits, and for UTF-32 is 32 bits. The standard then defines a few of these bits as flags: if they're set, then the next unit in a sequence of units is to be considered part of the same character. If they're not set, this unit represents one character fully. Thus the most common (English) characters only occupy one byte in UTF-8 (two in UTF-16, 4 in UTF-32), but other language characters can occupy six bytes or more.
1 - Unicode字符表
在这1114112个码点中,有11111998个可以存储Unicode字符, 因为有2048个码点保留为代理,66个码点保留为非字符。 所以,有1,111,998个码位可以存储唯一的字符、符号、表情符号等。
然而,到目前为止,在这1114112个代码点中,只有144697个被使用。 这144,697个代码点包含了涵盖所有语言的字符,以及符号、表情符号等。
Each character in the "Unicode" is assigned to a specific codepoint aka has a specific value / Unicode number. For Example the character "❤", has the following value aka Unicode number "U+2764". The value "U+2764" takes exactly one codepoint out of the 1,114,112 codepoints. The value "U+2764" looks like that in binary: "11100010 10011101 10100100", which is exactly 3 bytes or 24bits (without the two empty space characters, each of which taking 1 bit, but I have added them for visual purposes only, in order to make the 24bits more readable, so please ignore them).
现在,我们的计算机应该如何知道这3个字节“11100010 10011101 10100100”是分开读还是一起读?如果将这3个字节分别读取,然后转换为字符,结果将是“Ô, Ø, ñ”,这与我们的心形表情符号“❤”有很大的不同。
2 -编码标准(UTF-8, ISO-8859, Windows-1251等)
为了解决这个问题,人们发明了编码标准。 自2008年以来,最流行的是UTF-8。UTF-8平均占所有网页的97.6%,这就是为什么我们将UTF-8,如下面的例子。
2.1 -什么是编码?
编码,简单来说就是将某物从一种东西转换成另一种东西。 在我们的例子中,我们正在将数据,更确切地说是字节转换为UTF-8格式, 我还想把这句话重新表述为:“将字节转换为UTF-8字节”,尽管它在技术上可能不正确。
UTF-8使用最少1个字节来存储一个字符,最多4个字节。 多亏了UTF-8格式,我们可以拥有包含1个字节以上信息的字符。
这个字符需要16个二进制位“01101100 01001001”,因此正如我们上面讨论的那样,我们不能读取这个字符,除非我们将它编码为UTF-8,因为计算机将无法知道这两个字节是分开读取还是一起读取。
(正常的字节)"01101100 01001001" -> (UTF-8编码字节)"11100110 10110001 10001001"
2.4 UTF-8编码是如何工作的?
Binary format of bytes in sequence:
1st Byte 2nd Byte 3rd Byte 4th Byte Number of Free Bits Maximum Expressible Unicode Value
0xxxxxxx 7 007F hex (127)
110xxxxx 10xxxxxx (5+6)=11 07FF hex (2047)
1110xxxx 10xxxxxx 10xxxxxx (4+6+6)=16 FFFF hex (65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)
The "x" characters in the table above represent the number of "Free Bits", those bits are empty and we can write to them. The other bits are reserved for the UTF-8 format, they are used as headers / markers. Thanks to these headers, when the bytes are being read using the UTF-8 encoding, the computer knows, which bytes to read together and which seperately. The byte size of your character, after being encoded using the UTF-8 format, depends on how many bits you need to write. In our case the "汉" character is exactly 2 bytes or 16bits: "01101100 01001001" thus the size of our character after being encoded to UTF-8, will be 3 bytes or 24bits "11100110 10110001 10001001" because "3 UTF-8 bytes" have 16 Free Bits, which we can write to Solution, step by step below:
Header Place holder Fill in our Binary Result
1110 xxxx 0110 11100110
10 xxxxxx 110001 10110001
10 xxxxxx 001001 10001001
A Chinese character: 汉
its Unicode value: U+6C49
convert 6C49 to binary: 01101100 01001001
encode 6C49 as UTF-8: 11100110 10110001 10001001
3 - UTF-8, UTF-16和UTF-32之间的区别
UTF-8、UTF-16和UTF-32编码之间差异的原始解释: https://javarevisited.blogspot.com/2015/02/difference-between-utf-8-utf-16-and-utf.html
UTF-8至少使用1个字节,但如果字符更大,则可以使用2、3或4个字节。 UTF-8也与ASCII表兼容。
UTF-16至少使用2个字节。UTF-16不能占用3个字节,它可以占用2或4个字节。 UTF-16与ASCII表不兼容。
记住:UTF-8和UTF-16是变长编码, 其中UTF-8可以占用1到4个字节, 而UTF-16可以占用2或4个字节。 UTF-32是一种固定宽度的编码,它总是使用32位。
这篇文章解释了所有细节 http://kunststube.net/encoding/
00000000 11100011 10000001 10000010
00000000 00000000 00110000 01000010
例:如果你解码这个: 00000000 11100011 10000001 10000010 转换为UTF16编码,你将得到臣而不是あ
注意:Encoding和Unicode是两个不同的东西。Unicode是一个大(表),每个符号都映射到一个唯一的码点。例如,あ符号(字母)有一个(码位):30 42(十六进制)。另一方面,编码是一种将符号转换为更合适的方式的算法,当存储到硬件时。
30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.
30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.
Unicode是将字符映射到码点的标准。 每个字符都有一个唯一的编码点(识别号),它是一个像9731这样的数字。
UTF-8是码点的编码。 为了将所有字符存储在磁盘上(在文件中),UTF-8将字符分成最多4个八位字节(8位序列)-字节。 UTF-8是几种编码(表示数据的方法)之一。例如,在Unicode中,(十进制)码位9731表示一个雪人(☃),它在UTF-8中由3个字节组成:E2 98 83
ASCII characters are encoded exactly as they are in ASCII, such that an ASCII string is also a valid UTF-8 string representing the same characters. More efficient: Text strings in UTF-8 almost always occupy less space than the same strings in either UTF-32 or UTF-16, with just a few exceptions. Binary sorting: Sorting UTF-8 strings using a binary sort will still result in all code points being sorted in numerical order. When a code point uses multiple bytes, none of those bytes contain values in the ASCII range, ensuring that no part of them could be mistaken for an ASCII character. This is also a security feature. UTF-8 can be easily validated, and distinguished from other character encodings by a validator. Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8 due to the very specific structure of UTF-8. Random access: At any point in a UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not, and to find the start of the next or current character, without needing to scan forwards or backwards more than 3 bytes or to know how far into the string we started reading from.