我试图弄清楚我应该对各种类型的数据使用什么排序规则。100%的内容,我将存储是用户提交的。

我的理解是我应该使用UTF-8通用CI(不区分大小写)而不是UTF-8二进制。然而,我找不到UTF-8通用CI和UTF-8 Unicode CI之间的明确区别。

我应该在UTF-8通用列或UTF-8 Unicode CI列中存储用户提交的内容吗? UTF-8二进制适用于什么类型的数据?


当前回答

utf8_bin compares the bits blindly. No case folding, no accent stripping. utf8_general_ci compares one codepoint with one codepoint. It does case folding and accent stripping, but no 2-character comparisons; for example: ij is not equal ij in this collation. utf8_*_ci is a set of language-specific rules, but otherwise like unicode_ci. Some special cases: Ç, Č, ch, ll utf8_unicode_ci follows an old Unicode standard for comparisons. ij=ij, but ae != æ utf8_unicode_520_ci follows an newer Unicode standard. ae = æ

有关各种utf8排序规则中what等于what的详细信息,请参阅排序规则表。

utf8,正如MySQL所定义的那样,仅限于1- 3字节的utf8代码。这里省略了Emoji和一些中文。所以,如果你想在欧洲以外的地方使用,你真的应该改用utf8mb4。

经过适当的拼写更改后,以上几点适用于utf8mb4。今后,首选utf8mb4和utf8mb4_unicode_520_ci。或者(在8.0中)utf8mb4_0900_ai_ci

Utf16和utf32是utf8的变体;它们实际上毫无用处。 ucs2更接近Unicode而不是utf8;它实际上没有任何用处。

其他回答

公认的答案已经过时了。

如果您使用MySQL 5.5.3+,请使用utf8mb4_unicode_ci而不是utf8_unicode_ci来确保用户键入的字符不会出错。

例如,Utf8mb4支持表情符号,而utf8可能会给你数百个编码相关的bug,比如:

错误的字符串值:' \xF0\x9F\x98\x81…'用于第一行的列' data '

您还应该意识到这样一个事实,使用utf8_general_ci时,使用varchar字段作为唯一索引或主索引,插入像'a'和'á'这样的2个值将会给出重复的键错误。

一般来说,utf8_general_ci比utf8_unicode_ci快,但是不太正确。

区别在于:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

引用: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

更详细的解释,请阅读下面的帖子从MySQL论坛: http://forums.mysql.com/read.php?103,187048,188748

对于utf8_bin: utf8_general_ci和utf8_unicode_ci都执行不区分大小写的比较。相反,utf8_bin是区分大小写的(还有其他区别),因为它比较字符的二进制值。

真的,我测试了像“é”和“e”这样的保存值,在列中具有唯一的索引,它们会在“utf8_unicode_ci”和“utf8_general_ci”上造成重复错误。您只能将它们保存在'utf8_bin'排序列中。

mysql文档(在http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html)建议在其示例中设置'utf8_general_ci'排序规则。

[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci

utf8_bin compares the bits blindly. No case folding, no accent stripping. utf8_general_ci compares one codepoint with one codepoint. It does case folding and accent stripping, but no 2-character comparisons; for example: ij is not equal ij in this collation. utf8_*_ci is a set of language-specific rules, but otherwise like unicode_ci. Some special cases: Ç, Č, ch, ll utf8_unicode_ci follows an old Unicode standard for comparisons. ij=ij, but ae != æ utf8_unicode_520_ci follows an newer Unicode standard. ae = æ

有关各种utf8排序规则中what等于what的详细信息,请参阅排序规则表。

utf8,正如MySQL所定义的那样,仅限于1- 3字节的utf8代码。这里省略了Emoji和一些中文。所以,如果你想在欧洲以外的地方使用,你真的应该改用utf8mb4。

经过适当的拼写更改后,以上几点适用于utf8mb4。今后,首选utf8mb4和utf8mb4_unicode_520_ci。或者(在8.0中)utf8mb4_0900_ai_ci

Utf16和utf32是utf8的变体;它们实际上毫无用处。 ucs2更接近Unicode而不是utf8;它实际上没有任何用处。