Unicode和UTF-8的区别是什么?

正如Rasmus在他的文章《UTF-8和Unicode的区别?》:

If asked the question, "What is the difference between UTF-8 and Unicode?", would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand these concepts as well as we should. If you feel you belong to this group, you should read this ultra short introduction to character sets and encodings. Actually, comparing UTF-8 and Unicode is like comparing apples and oranges: UTF-8 is an encoding - Unicode is a character set A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41. An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this: 00000001 00000010 00000011 00000100 Our data is now translated into binary and can now be saved to disk. All together now Say an application reads the following from the disk: 1101000 1100101 1101100 1101100 1101111 The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this: 104 101 108 108 111 Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is "hello". Conclusion So when somebody asks you "What is the difference between UTF-8 and Unicode?", you can now confidently answer short and precise: UTF-8 (Unicode Transformation Format) and Unicode cannot be compared. UTF-8 is an encoding used to translate numbers into binary data. Unicode is a character set used to translate characters into numbers.

2012-11-03 19:09:50

UTF-16和UTF-8都是Unicode的编码。它们都是Unicode;一个并不比另一个更符合统一码。

不要被微软的一个不幸的历史文物所迷惑。

2010-10-17 03:19:53

正如Rasmus在他的文章《UTF-8和Unicode的区别?》:

If asked the question, "What is the difference between UTF-8 and Unicode?", would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand these concepts as well as we should. If you feel you belong to this group, you should read this ultra short introduction to character sets and encodings. Actually, comparing UTF-8 and Unicode is like comparing apples and oranges: UTF-8 is an encoding - Unicode is a character set A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41. An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this: 00000001 00000010 00000011 00000100 Our data is now translated into binary and can now be saved to disk. All together now Say an application reads the following from the disk: 1101000 1100101 1101100 1101100 1101111 The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this: 104 101 108 108 111 Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is "hello". Conclusion So when somebody asks you "What is the difference between UTF-8 and Unicode?", you can now confidently answer short and precise: UTF-8 (Unicode Transformation Format) and Unicode cannot be compared. UTF-8 is an encoding used to translate numbers into binary data. Unicode is a character set used to translate characters into numbers.

2012-11-03 19:09:50

这里有很多误解。Unicode不是一种编码，但Unicode标准主要用于编码。

iso10646是您(可能)关心的国际字符集。它定义了一组命名字符(例如，“拉丁大写字母a”或“希腊小写字母alpha”)和一组代码点(分配给每个字符的数字——例如，这两个字符分别为61十六进制和3B1十六进制;对于Unicode码位，标准符号是U+0061和U+03B1)。

Unicode曾经定义了自己的字符集，或多或少是iso10646的竞争对手。这是一个16位字符集，但不是UTF-16;被称为UCS-2。它包含了一种颇有争议的技术，即尽量减少必要字符的数量(游戏邦注:即汉朝统一—-基本上是将中国，日本和韩国的字符视为相同的字符)。

从那以后，统一码联盟已经默认这行不通，现在主要集中精力研究如何对iso10646字符集进行编码。主要的方法是UTF-8, UTF-16和UCS-4(又名UTF-32)。这些(UTF-8除外)也有LE(小端序)和BE(大端序)变体。

By itself, "Unicode" could refer to almost any of the above (though we can probably eliminate the others that it shows explicitly, such as UTF-8). Unqualified use of "Unicode" probably happens the most often on Windows, where it will almost certainly refer to UTF-16. Early versions of Windows NT adopted Unicode when UCS-2 was current. After UCS-2 was declared obsolete (around Win2k, if memory serves), they switched to UTF-16, which is the most similar to UCS-2 (in fact, it's identical for characters in the "basic multilingual plane", which covers a lot, including all the characters for most Western European languages).

2010-10-17 03:05:48

实际上，大多数编辑器都支持另存为“Unicode”编码。

这是Windows的一个不幸的错误命名。

因为Windows内部使用UTF-16LE编码作为Unicode字符串的内存存储格式，它认为这是Unicode文本的自然编码。在Windows世界中，有ANSI字符串(当前机器上的系统代码页，受限于完全不可移植性)和Unicode字符串(在内部存储为UTF-16LE)。

这些都是在Unicode的早期设计的，在我们意识到UCS-2是不够的，在UTF-8被发明之前。这就是为什么Windows对UTF-8的支持在各方面都很差。

这个错误的命名方案成为用户界面的一部分。使用Windows编码支持来提供一系列编码的文本编辑器会自动且不恰当地将UTF-16LE描述为“Unicode”，而将UTF-16BE(如果提供的话)描述为“Unicode大端典”。

(其他自己进行编码的编辑器，如notepad++，就没有这个问题。)

' ANSI '字符串也不是基于任何ANSI标准，如果这让你感觉更好的话。

2010-10-17 02:57:30

Unicode的开发是有目的的致力于创建映射的新标准在绝大多数的字符今天使用的语言，和其他角色一起不是那么重要，但可能是创建文本所必需的。utf - 8 只是你众多方式中的一种可以编码的文件，因为有编码的方法有很多文件中的字符转换为Unicode。

来源:

http://www.differencebetween.net/technology/difference-between-unicode-and-utf-8/

2010-10-17 02:19:47

Unicode和UTF-8的区别是什么?

推荐文章

最新文章

标签