UTF-8、UTF-16和UTF-32之间有什么区别?
我知道它们都将存储Unicode,并且每个都使用不同的字节数来表示一个字符。选择一个比另一个有优势吗?
UTF-8、UTF-16和UTF-32之间有什么区别?
我知道它们都将存储Unicode,并且每个都使用不同的字节数来表示一个字符。选择一个比另一个有优势吗?
当前回答
我做了一些测试来比较MySQL中UTF-8和UTF-16之间的数据库性能。
更新的速度
utf - 8
utf - 16
插入的速度
删除速度
其他回答
UTF-8为变量1 ~ 4字节。 UTF-16是可变的2或4字节。 UTF-32是固定的4字节。
在阅读完答案后,UTF-32需要一些爱。
C#:
Data1 = RandomNumberGenerator.GetBytes(500_000_000);
sw = Stopwatch.StartNew();
int l = Encoding.UTF8.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-8: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.Unicode.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"Unicode: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.UTF32.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-32: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.ASCII.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"ASCII: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
UTF-8—经过9.939秒-大小473,752,800
Unicode—消失0.853秒-大小2.5亿
UTF-32—消失3.143秒-大小125,030,570
ASCII—经过2.362秒-大小500,000,000
Utf-32——丢麦克风
Unicode是一个标准,关于UTF-x,你可以把它看作是一个技术实现,用于一些实际目的:
UTF-8 - "size optimized": best suited for Latin character based data (or ASCII), it takes only 1 byte per character but the size grows accordingly symbol variety (and in worst case could grow up to 6 bytes per character) UTF-16 - "balance": it takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character handling (but size is still variable and can grow up to 4 bytes per character) UTF-32 - "performance": allows using of simple algorithms as result of fixed size characters (4 bytes) but with memory disadvantage
如前所述,差异主要在于底层变量的大小,在每种情况下,它们都会变大以允许表示更多字符。
然而,字体、编码和其他东西都非常复杂(没有必要?),所以需要一个大链接来填充更多细节:
http://www.cs.tut.fi/~jkorpela/chars.html#ascii
不要期望理解所有的东西,但是如果你不想在以后遇到问题,尽可能早地学习(或者让别人帮你整理)是值得的。
保罗。
简而言之:
UTF-8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text. UTF-16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text. UTF-32: Fixed-width encoding. All code points take four bytes. An enormous memory hog, but fast to operate on. Rarely used.
长:参见维基百科:UTF-8, UTF-16和UTF-32。