试图理解现代Unicode的微妙之处让我头疼。特别是,码位、字符、符号和字素之间的区别——在最简单的情况下,当使用ASCII字符处理英语文本时,这些概念彼此之间都有一对一的关系——给我带来了麻烦。

看到这些术语是如何在Matthias Bynens的JavaScript有unicode问题或维基百科关于汉族统一的文章中使用的,我已经收集到这些概念不是同一件事,合并它们是危险的,但我有点努力理解每个术语的含义。

统一码联盟提供了一个术语表来解释这些东西,但它充满了像这样的“定义”:

Abstract Character. A unit of information used for the organization, control, or representation of textual data. ... ... Character. ... (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. ... ... Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character. ... Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. ...

这些定义大多听起来非常学术和正式,但缺乏任何意义,或者将定义问题推迟到标准的另一个术语表条目或部分。

因此,我向那些比我更有学问的人寻求神秘的智慧,这些概念之间究竟有什么不同?在什么情况下,它们彼此之间不会有一对一的关系?


Character is an overloaded term that can mean many things. A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard. A code unit is the unit of storage of a part of an encoded code point. In UTF-8 this means 8 bits, in UTF-16 this means 16 bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph (☃) is a single code point but 3 UTF-8 code units, and 1 UTF-16 code unit. A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaeresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides). A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may choose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.


在Unicode标准之外,字符是由一个或多个字母组成的单个文本单位。Unicode标准定义的“字符”实际上是字母和字符的混合体。Unicode提供了将并列的字素解释为单个字符的规则。

Unicode码位是分配给每个Unicode字符(可以是字符,也可以是字素)的唯一数字。

不幸的是,Unicode规则允许将一些并列的字素解释为已经拥有自己代码点的其他字素(预先组合的形式)。这意味着在Unicode中有不止一种表示字符的方法。Unicode标准化解决了这个问题。

字形是字符的视觉表示。字体为特定的一组字符(不是Unicode字符)提供一组字形。对于每个字符,都有无限个可能的字形。

回复Mark Amery

首先,正如我所说的,每个字符都有无限个可能的字形,所以不,一个字符并不“总是由一个单一的字形表示”。Unicode本身并不太关心字形,而且它在代码图表中定义的东西当然也不是字形。问题是他们都不是角色。那么它们是什么呢?

哪个是更大的实体,字素还是字符?文字中那些不是字母或标点符号的图形元素叫什么?一个很快出现在脑海中的术语是“字素”。这个词准确地让人联想到“文本中的图形单位”的概念。我给出了这样的定义:字素是书面文本中最小的独立成分。

我们也可以反过来说,字素是由汉字组成的,但这样它们就被称为“汉字字素”,而由汉字字素组成的那些碎片就只能被称为“汉字”了。然而,这一切都是相反的。字素是不同的小碎片。角色更加成熟。“符号是可组合的”这个短语在Unicode上下文中应该更好地表述为“字符是可组合的”。

Unicode定义了字符,但它也定义了与其他字素或字符组成的字素。你创作的那些怪物就是一个很好的例子。如果它们流行起来,也许它们会在Unicode的后续版本中获得自己的代码点;)

这里有一个递归元素。在更高的层次上,字素变成了字符变成了字素,但它一直都是字素。

回复T S

第1章 标准声明:“Unicode字符编码处理字母字符, 表意文字,相当于符号,这意味着它们可以被使用 在任何混合物中,以同样的方式"根据这句话,我们应该是 为标准中一些术语的合并做好准备。有时适当的 术语只有在标准发展的过程中才会变得清晰。

它经常发生在正式定义一种语言的两个基本方面 事物是根据彼此来定义的。例如,在 在XML中,元素被定义为开始标记 可能后面跟着内容,后面跟着结束标记。内容定义在 Turn可以作为元素、字符数据或其他一些可能的东西。一个 自引用定义的模式也隐含在Unicode中 标准:

字素是一个码位或字符。 字符由一个或多个字母序列组成。

When first confronted with these two definitions the reader might object to the first definition on the grounds that a code point is a character, but that's not always true. A sequence of two code points sometimes encodes a single code point under normalization, and that encoded code point represents the character, as illustrated in figure 2.7. Sequences of code points that encode other code points. This is getting a little tricky and we haven't even reached the layer where where character encoding schemes such as UTF-8 are used to encode code points into byte sequences.

在某些情况下,例如一篇关于 变音符和个体 一个字符的一部分可能会自己出现在文本中。在这种情况下, 单个字符部分可以被认为是一个字符,所以这是有意义的 统一码标准也保持灵活。

As Mark Avery pointed out, a character can be composed into a more complex thing. That is, each character can can serve as a grapheme if desired. The final result of all composition is a thing that "the user thinks of as a character". There doesn't seem to be any real resistance, either in the standard or in this discussion, to the idea that at the highest level there are these things in the text that the user thinks of as individual characters. To avoid overloading that term, we can use "grapheme" in all cases where we want to refer to parts used to compose a character.

有时候,Unicode标准的术语到处都是。为 例子,第三章 将UTF-8定义为“编码形式”,而术语表定义为“编码” 而UTF-8则是“字符编码方案”。另一个 例如“Grapheme_Base”和“Grapheme_Extend”,它们是 公认为 错误却一直存在,因为清除它们是一项艰巨的任务。有 仍有工作要做,以加强标准所使用的术语。

The Proposal for addition of COMBINING GRAPHEME JOINER got it wrong when it stated that "Graphemes are sequences of one or more encoded characters that correspond to what users think of as characters." It should instead read, "A sequence of one or more graphemes composes what the user thinks of as a character." Then it could use the term "grapheme sequence" distinctly from the term "character sequence". Both terms are useful. "grapheme sequence" neatly implies the process of building up a character from smaller pieces. "character sequence" means what we all typically intuit it to mean: "A sequence of things the user thinks of as characters."

有时程序员确实希望在字素级别上进行操作 序列,所以检查和操作这些序列的机制应该是 可用,但通常在处理文本时,对其进行操作就足够了 “字符序列”(用户认为的字符),并让 系统管理底层细节。

在本文讨论到目前为止所涉及的每种情况中,使用“grapheme”更简洁。 指不可分割的组成部分,“字符”指组成部分 实体。这种用法也更好地反映了两者长期以来的含义 条款。