UTF-8、UTF-16、UTF-32

简而言之:

UTF-8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text. UTF-16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text. UTF-32: Fixed-width encoding. All code points take four bytes. An enormous memory hog, but fast to operate on. Rarely used.

长:参见维基百科:UTF-8, UTF-16和UTF-32。

2009-01-30 17:10:09

UTF-8为变量1 ~ 4字节。 UTF-16是可变的2或4字节。 UTF-32是固定的4字节。

2009-01-30 17:10:29

在阅读完答案后，UTF-32需要一些爱。

C#:

Data1 = RandomNumberGenerator.GetBytes(500_000_000);

sw = Stopwatch.StartNew();
int l = Encoding.UTF8.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-8: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s}   Size - {l:###,###,###}");

sw = Stopwatch.StartNew();
l = Encoding.Unicode.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"Unicode: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s}   Size - {l:###,###,###}");

sw = Stopwatch.StartNew();
l = Encoding.UTF32.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-32: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s}   Size - {l:###,###,###}");

sw = Stopwatch.StartNew();
l = Encoding.ASCII.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"ASCII: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s}   Size - {l:###,###,###}");

UTF-8—经过9.939秒-大小473,752,800

Unicode—消失0.853秒-大小2.5亿

UTF-32—消失3.143秒-大小125,030,570

ASCII—经过2.362秒-大小500,000,000

Utf-32——丢麦克风

2022-02-06 04:21:53

简而言之，使用UTF-16或UTF-32的唯一原因是分别支持非英语和古代脚本。

我想知道为什么有人会选择非utf -8编码，因为它显然对web/编程更有效。

一个常见的误解-加后缀的数字不是它的能力的指示。它们都支持完整的Unicode，只是UTF-8可以用一个字节处理ASCII，所以对CPU和互联网来说更有效/更不容易损坏。

一些不错的阅读:http://www.personal.psu.edu/ejp10/blogs/gotunicode/2007/10/which_utf_do_i_use.html 和http://utf8everywhere.org

2015-05-21 15:12:32

简而言之:

UTF-8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text. UTF-16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text. UTF-32: Fixed-width encoding. All code points take four bytes. An enormous memory hog, but fast to operate on. Rarely used.

长:参见维基百科:UTF-8, UTF-16和UTF-32。

2009-01-30 17:10:09

utf - 8

没有字节顺序的概念每个字符使用1到4个字节 ASCII是一种兼容的编码子集完全自同步，例如从流中的任何地方删除字节最多只会损坏一个字符几乎所有的欧洲语言都是用两个字节或更少的字符编码的

utf - 16

必须使用已知的字节顺序进行解析，或者读取字节顺序标记(BOM)。每个字符使用2或4个字节

utf - 32

每个字符是4个字节必须使用已知的字节顺序进行解析，或者读取字节顺序标记(BOM)。

UTF-8将是空间效率最高的，除非大多数字符来自CJK(中国、日本和韩国)字符空间。

UTF-32最适合通过字符偏移量随机访问字节数组。

2015-03-05 20:05:10

UTF-8、UTF-16、UTF-32

推荐文章

最新文章

标签