Unicode、UTF8、UTF7、UTF16、UTF32、ASCII和ANSI编码之间有什么区别?
这些对程序员有什么帮助?
Unicode、UTF8、UTF7、UTF16、UTF32、ASCII和ANSI编码之间有什么区别?
这些对程序员有什么帮助?
当前回答
下面是你的清单:
"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding. UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis. UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte. UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.) UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.) ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc. ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.
在我的Unicode页面上有更多关于调试Unicode问题的提示。
另一个重要的代码资源是unicode.org,它包含的信息比你能找到的要多——可能最有用的部分是代码图表。
其他回答
下面是你的清单:
"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding. UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis. UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte. UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.) UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.) ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc. ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.
在我的Unicode页面上有更多关于调试Unicode问题的提示。
另一个重要的代码资源是unicode.org,它包含的信息比你能找到的要多——可能最有用的部分是代码图表。
一些关于字符编码的阅读:Joel on Software: 每个软件开发人员必须绝对、肯定地了解Unicode和字符集(没有借口!)
顺便说一下,ASP。NET与此无关。编码是通用的。