什么是Unicode、UTF-8和UTF-16?

Unicode的基础是什么?为什么需要UTF-8或UTF-16? 我在谷歌上研究过这个，也在这里搜索过，但我不清楚。

在VSS中，当进行文件比较时，有时会有一个消息说两个文件有不同的UTF。为什么会这样呢?

请简单解释一下。

当前回答

Unicode是一种将所有语言中的字符映射到称为码位的特定数值的标准。它这样做的原因是它允许使用相同的代码点集进行不同的编码。

UTF-8和UTF-16就是两种这样的编码。它们将代码点作为输入，并使用一些定义良好的公式对它们进行编码，以生成编码后的字符串。

选择特定的编码取决于您的需求。不同的编码有不同的内存要求，根据将要处理的字符，应该选择使用最少字节序列来编码这些字符的编码。

有关Unicode, UTF-8和UTF-16的更多详细信息，您可以查看这篇文章，

关于Unicode，每个程序员都应该知道的

2017-03-25 15:10:57

其他回答

最初，Unicode打算使用固定宽度的16位编码(UCS-2)。Unicode的早期采用者，如Java和Windows NT，围绕16位字符串构建了它们的库。

后来，Unicode的范围扩大到包括历史字符，这将需要超过16位编码所支持的65,536个编码点。为了允许在使用UCS-2的平台上表示额外的字符，引入了UTF-16编码。它使用“代理对”来表示补充平面中的字符。

与此同时，许多旧的软件和网络协议使用8位字符串。UTF-8是为了让这些系统可以支持Unicode而不必使用宽字符。它向后兼容7位ASCII。

2010-07-05 05:04:27

Unicode是一种将所有语言中的字符映射到称为码位的特定数值的标准。它这样做的原因是它允许使用相同的代码点集进行不同的编码。

UTF-8和UTF-16就是两种这样的编码。它们将代码点作为输入，并使用一些定义良好的公式对它们进行编码，以生成编码后的字符串。

选择特定的编码取决于您的需求。不同的编码有不同的内存要求，根据将要处理的字符，应该选择使用最少字节序列来编码这些字符的编码。

有关Unicode, UTF-8和UTF-16的更多详细信息，您可以查看这篇文章，

关于Unicode，每个程序员都应该知道的

2017-03-25 15:10:57

为什么我们需要统一码?

在(不是太)早期，所有存在的都是ASCII。这是可以的，因为所需要的只是一些控制字符、标点符号、数字和字母，就像这句话中的这些。不幸的是，今天这个全球相互交流和社交媒体的陌生世界并没有被预见到，在同一份文件中看到英文、العربية、汉语、ְִרִי、ελληνικ和ភាសាខ្មែរ也不是太罕见(希望我没有弄坏任何旧浏览器)。

但是为了讨论，让我们假设Joe Average是一个软件开发人员。他坚持说他永远只需要英语，因此他只想使用ASCII码。这对用户Joe来说可能没问题，但对软件开发人员Joe来说就不好了。世界上大约有一半的人使用非拉丁字符，使用ASCII可能是对这些人的不体贴，最重要的是，他正在将他的软件向一个庞大的、不断增长的经济体关闭。

Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.

内存方面的考虑

那么有多少字节可以访问这些编码中的哪些字符呢?

utf - 8:

1字节:标准ASCII码 2字节:阿拉伯语，希伯来语，大多数欧洲脚本(最明显的不包括格鲁吉亚) 3字节:BMP 4字节:所有Unicode字符

utf - 16:

2字节:BMP 4字节:所有Unicode字符

值得一提的是，不在BMP中的字符包括古代文字、数学符号、音乐符号和更罕见的中文、日语和韩语(CJK)字符。

如果您将主要使用ASCII字符，那么UTF-8肯定更节省内存。但是，如果您主要使用非欧洲脚本，使用UTF-8的内存效率可能比UTF-16低1.5倍。在处理大量文本时，如大网页或冗长的word文档，这可能会影响性能。

编码的基本知识

注意:如果您知道UTF-8和UTF-16是如何编码的，请跳到下一节了解实际应用。

UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid clashing with the ASCII characters. UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.

可以看到，UTF-8和UTF-16彼此完全不兼容。所以如果你在做I/O，确保你知道你在使用哪种编码!有关这些编码的进一步细节，请参阅UTF常见问题解答。

实际编程注意事项

Character and string data types: How are they encoded in the programming language? If they are raw bytes, the minute you try to output non-ASCII characters, you may run into a few problems. Also, even if the character type is based on a UTF, that doesn't mean the strings are proper UTF. They may allow byte sequences that are illegal. Generally, you'll have to use a library that supports UTF, such as ICU for C, C++ and Java. In any case, if you want to input/output something other than the default encoding, you will have to convert it first.

Recommended, default, and dominant encodings: When given a choice of which UTF to use, it is usually best to follow recommended standards for the environment you are working in. For example, UTF-8 is dominant on the web, and since HTML5, it has been the recommended encoding. Conversely, both .NET and Java environments are founded on a UTF-16 character type. Confusingly (and incorrectly), references are often made to the "Unicode encoding", which usually refers to the dominant UTF encoding in a given environment.

库支持:您正在使用的库支持某种编码。哪一个?他们支持极端情况吗?因为需要是发明之母，UTF-8库通常会正确地支持4字节字符，因为1、2甚至3字节字符经常出现。然而，并不是所有的UTF-16库都正确地支持代理对，因为它们很少出现。

字符计数:Unicode中存在组合字符。例如，代码点U+006E (n)和U+0303(一个组合波浪号)组成ñ，而代码点U+00F1组成ñ。它们看起来应该是相同的，但是一个简单的计数算法将为第一个示例返回2，为后一个示例返回1。这并不一定是错误的，但也可能不是理想的结果。

平等比较:A、А和Α看起来一样，但它们分别是拉丁语、西里尔语和希腊语。你也有C和Ⅽ这样的情况。一个是字母，另一个是罗马数字。此外，我们还需要考虑组合字符。有关更多信息，请参见Unicode中的重复字符。

代理对:这些在Stack Overflow中经常出现，所以我只提供一些示例链接:

获取字符串长度删除代理对回文检查

2013-02-28 05:24:15

Unicode是一个相当复杂的标准。不要太害怕，但要做为一些工作做准备![２]

因为总是需要可靠的资源，但官方报告非常庞大，我建议阅读以下内容:

每个软件开发人员必须绝对、肯定地了解Unicode和字符集(没有借口!)Stack Exchange首席执行官Joel Spolsky的介绍。为BMP和超越!Unicode联盟的技术总监，后来的副总裁Eric Muller的教程(前20张幻灯片就完成了)

简要说明:

计算机读取字节，而人读取字符，因此我们使用编码标准将字符映射到字节。ASCII是第一个被广泛使用的标准，但只包含拉丁语(7位/字符可以代表128个不同的字符)。Unicode是一个标准，目标是覆盖世界上所有可能的字符(最多可以容纳1,114,112个字符，意味着每个字符最多21位。当前的Unicode 8.0总共指定120,737个字符，仅此而已)。

主要的区别是ASCII字符可以容纳一个字节(8位)，但大多数Unicode字符不能。所以使用编码形式/方案(如UTF-8和UTF-16)，字符模型是这样的:

Every character holds an enumerated position from 0 to 1,114,111 (hex: 0-10FFFF) called a code point. An encoding form maps a code point to a code unit sequence. A code unit is the way you want characters to be organized in memory, 8-bit units, 16-bit units and so on. UTF-8 uses one to four units of eight bits, and UTF-16 uses one or two units of 16 bits, to cover the entire Unicode of 21 bits maximum. Units use prefixes so that character boundaries can be spotted, and more units mean more prefixes that occupy bits. So, although UTF-8 uses one byte for the Latin script, it needs three bytes for later scripts inside a Basic Multilingual Plane, while UTF-16 uses two bytes for all these. And that's their main difference. Lastly, an encoding scheme (like UTF-16BE or UTF-16LE) maps (serializes) a code unit sequence to a byte sequence.

性格:π 代码点:U+03C0 编码形式(编码单位): utf-8: cf 80 utf - 16: 03 c0 编码方案(字节): utf-8: cf 80 utf-16be: 03 c0 utf-16le: c0 03

提示:十六进制数字代表四位，所以两位十六进制数字代表一个字节。也可以看看维基百科上的平面地图，了解一下字符集布局。

2015-10-27 01:03:19

UTF代表Unicode转换格式。基本上，在当今世界，有数百种其他语言编写的脚本，这些脚本的格式不包括在以前使用的基本ASCII中。因此，UTF应运而生。

UTF-8具有字符编码功能，其代码单位为8位，而UTF-16为16位。

2016-08-30 09:39:34

什么是Unicode、UTF-8和UTF-16?

推荐文章

最新文章

标签