词法分析器和解析器在理论上真的有那么大的不同吗?

讨厌正则表达式似乎很时髦:编程恐怖,另一篇博客文章。

然而,流行的基于词法的工具:pygest、geshi或prettify都使用正则表达式。他们似乎什么都能做……

什么时候lexing是足够的,什么时候你需要EBNF?

有人在bison或antlr解析器生成器中使用这些词法分析器生成的标记吗?


当前回答

回答所问的问题(不过分重复中出现的内容) 其他答案)

Lexers and parsers are not very different, as suggested by the accepted answer. Both are based on simple language formalisms: regular languages for lexers and, almost always, context-free (CF) languages for parsers. They both are associated with fairly simple computational models, the finite state automaton and the push-down stack automaton. Regular languages are a special case of context-free languages, so that lexers could be produced with the somewhat more complex CF technology. But it is not a good idea for at least two reasons.

编程的一个基本要点是系统组件应该 采用最合适的技术,使之易于实现 生产,理解和维护。技术不应该如此 过度杀戮(使用比实际需要复杂和昂贵得多的技术), 它也不应该在其权力的极限,因此需要技术 为了达到预期的目标而扭曲。

这就是为什么“讨厌正则表达式似乎是一种时尚”。 虽然它们可以做很多事情,但有时需要非常难以阅读 编码来实现它,更不用说各种扩展了 而实施上的限制在一定程度上降低了它们的理论价值 简单。lexer通常不这样做,通常是一个简单的, 高效、合适的token解析技术。使用CF解析器 虽然有可能,但象征性就太过分了。

不为词法分析器使用CF形式主义的另一个原因是,它可能会 然后尝试使用全CF能力。但这可能会提高 有关程序读取的结构问题。

Fundamentally, most of the structure of program text, from which meaning is extracted, is a tree structure. It expresses how the parse sentence (program) is generated from syntax rules. Semantics is derived by compositional techniques (homomorphism for the mathematically oriented) from the way syntax rules are composed to build the parse tree. Hence the tree structure is essential. The fact that tokens are identified with a regular set based lexer does not change the situation, because CF composed with regular still gives CF (I am speaking very loosely about regular transducers, that transform a stream of characters into a stream of token).

然而,CF与CF组成(通过CF传感器…不好意思 数学),不一定给CF,和可能使事情更多 一般,但在实践中不太容易控制。所以CF不合适 lexers的工具,即使它可以使用。

常规和CF的主要区别之一是常规 语言(和转换器)几乎可以与任何语言很好地组合 形式主义在不同的方式,而CF语言(和传感器)做 没有,甚至连他们自己也没有(除了少数例外)。

(注意,常规换能器可能有其他用途,如 一些语法错误处理技术的形式化。)

BNF只是用于表示CF语法的特定语法。

EBNF是BNF的一个语法糖,使用了正则的功能 以给出更简洁的BNF语法版本。总是可以的 转化为等效的纯BNF。

然而,EBNF中通常使用常规符号只是为了强调这些 与词汇结构相对应的语法部分 元素,并且应该用lexer识别,而其余的用lexer识别 宁可以直的BNF形式呈现。但这并不是绝对的规则。

综上所述,使用更简单的token结构可以更好地进行分析 技术比较简单的常规语言,同时面向树 语言的结构(程序语法)更好地由CF处理 语法。

我建议你也看看AHR的答案。

但这留下了一个悬而未决的问题:为什么是树?

树是指定语法的良好基础,因为

它们使文章结构简单 语义与文本的关联是非常方便的 在这个结构的基础上,用数学方法很好地解释了 被理解的技术(通过同态的组合),如 上面显示。它是一个基本的代数工具来定义 数学形式主义的语义。

因此,它是一个很好的中间表示,如所示 抽象语法树的成功。注意AST通常是 不同于解析树是因为解析技术被很多人所采用 专业人员(如LL或LR)只应用于CF的一个子集 语法,从而导致后来的语法扭曲 这可以通过更通用的解析来避免 技术(基于动态编程),它接受任何CF语法。

关于编程语言是 上下文敏感(CS)而不是CF是任意的和有争议的。

问题是语法和语义的分离 任意的。检查声明或类型协议可能被视为 要么是语法,要么是语义。同样也适用于 自然语言中的性别与数字一致性。但是有自然的 复数是否一致取决于实际语义的语言 单词的意义,使它不太符合语法。

在指称语义学中对编程语言的许多定义 在语义中放置声明和类型检查。所以陈述为 Ira Baxter所做的CF解析器正在被黑客攻击以获得上下文 语法所要求的灵敏度充其量是任意的视图 的情况。在一些编译器中,它可能被组织成一个hack,但是它 不必如此。

此外,不仅仅是CS解析器(在这里的其他答案中使用的意义上)很难构建,而且更少 非常高效。它们也不足以明确地表达 可能需要一些上下文敏感性。但他们没有 自然地产生一种语法结构(如解析树) 是否方便推导程序的语义,即生成 编译后的代码。

其他回答

为什么编译器的分析部分是正常的,原因有很多 分为词法分析和解析(语法分析)两个阶段。

Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and white space as syntactic units would be. Considerably more complex than one that can assume comments and white space have already been removed by the lexical analyzer. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer.

resource___Compilers(第二版) 写的, 阿尔弗雷德·v·Abo 哥伦比亚大学 林忆莲 斯坦福大学 拉维塞提 亚美亚 杰弗里·d·厄尔曼 斯坦福大学

解析器和词法分析器的共同之处:

They read symbols of some alphabet from their input. Hint: The alphabet doesn't necessarily have to be of letters. But it has to be of symbols which are atomic for the language understood by parser/lexer. Symbols for the lexer: ASCII characters. Symbols for the parser: the particular tokens, which are terminal symbols of their grammar. They analyse these symbols and try to match them with the grammar of the language they understood. Here's where the real difference usually lies. See below for more. Grammar understood by lexers: regular grammar (Chomsky's level 3). Grammar understood by parsers: context-free grammar (Chomsky's level 2). They attach semantics (meaning) to the language pieces they find. Lexers attach meaning by classifying lexemes (strings of symbols from the input) as the particular tokens. E.g. All these lexemes: *, ==, <=, ^ will be classified as "operator" token by the C/C++ lexer. Parsers attach meaning by classifying strings of tokens from the input (sentences) as the particular nonterminals and building the parse tree. E.g. all these token strings: [number][operator][number], [id][operator][id], [id][operator][number][operator][number] will be classified as "expression" nonterminal by the C/C++ parser. They can attach some additional meaning (data) to the recognized elements. When a lexer recognizes a character sequence constituting a proper number, it can convert it to its binary value and store with the "number" token. Similarly, when a parser recognize an expression, it can compute its value and store with the "expression" node of the syntax tree. They all produce on their output a proper sentences of the language they recognize. Lexers produce tokens, which are sentences of the regular language they recognize. Each token can have an inner syntax (though level 3, not level 2), but that doesn't matter for the output data and for the one which reads them. Parsers produce syntax trees, which are representations of sentences of the context-free language they recognize. Usually it's only one big tree for the whole document/source file, because the whole document/source file is a proper sentence for them. But there aren't any reasons why parser couldn't produce a series of syntax trees on its output. E.g. it could be a parser which recognizes SGML tags sticked into plain-text. So it'll tokenize the SGML document into a series of tokens: [TXT][TAG][TAG][TXT][TAG][TXT]....

As you can see, parsers and tokenizers have much in common. One parser can be a tokenizer for other parser, which reads its input tokens as symbols from its own alphabet (tokens are simply symbols of some alphabet) in the same way as sentences from one language can be alphabetic symbols of some other, higher-level language. For example, if * and - are the symbols of the alphabet M (as "Morse code symbols"), then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code. The sentences in the language "Morse Code" could be tokens for some other parser, for which these tokens are atomic symbols of its language (e.g. "English Words" language). And these "English Words" could be tokens (symbols of the alphabet) for some higher-level parser which understands "English Sentences" language. And all these languages differ only in the complexity of the grammar. Nothing more.

那么,这些“乔姆斯基的语法水平”到底是怎么回事呢?诺姆·乔姆斯基根据语法的复杂程度将其分为四个等级:

Level 3: Regular grammars They use regular expressions, that is, they can consist only of the symbols of alphabet (a,b), their concatenations (ab,aba,bbb etd.), or alternatives (e.g. a|b).They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton).Regular grammars can't handle with nested syntax, e.g. properly nested/matched parentheses (()()(()())), nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels. Level 2: Context-free grammars They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well.They can be implemented as state automaton with stack. This stack is used to represent the nesting level of the syntax. In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax.But they can't handle with a context-sensitive syntax. E.g. when you have an expression x+3 and in one context this x could be a name of a variable, and in other context it could be a name of a function etc. Level 1: Context-sensitive grammars Level 0: Unrestricted grammarsAlso called recursively enumerable grammars.

是的,它们在理论上和执行上都有很大的不同。

词法分析器用于识别构成语言元素的“单词”,因为这些单词的结构通常很简单。正则表达式非常擅长处理这种更简单的结构,并且有非常高性能的正则表达式匹配引擎用于实现lexer。

解析器用于识别语言短语的“结构”。这样的结构通常远远超出了“正则表达式”所能识别的范围,因此需要 “上下文敏感”解析器来提取这样的结构。上下文敏感的解析器 很难构建,所以工程上的妥协是使用“上下文无关”语法 并在解析器(“符号表”等)中添加一些技巧来处理上下文敏感的部分。

词法分析和解析技术都不太可能很快消失。

They may be unified by deciding to use "parsing" technology to recognize "words", as is currently explored by so-called scannerless GLR parsers. That has a runtime cost, as you are applying more general machinery to what is often a problem that doesn't need it, and usually you pay for that in overhead. Where you have lots of free cycles, that overhead may not matter. If you process a lot of text, then the overhead does matter and classical regular expression parsers will continue to be used.

回答所问的问题(不过分重复中出现的内容) 其他答案)

Lexers and parsers are not very different, as suggested by the accepted answer. Both are based on simple language formalisms: regular languages for lexers and, almost always, context-free (CF) languages for parsers. They both are associated with fairly simple computational models, the finite state automaton and the push-down stack automaton. Regular languages are a special case of context-free languages, so that lexers could be produced with the somewhat more complex CF technology. But it is not a good idea for at least two reasons.

编程的一个基本要点是系统组件应该 采用最合适的技术,使之易于实现 生产,理解和维护。技术不应该如此 过度杀戮(使用比实际需要复杂和昂贵得多的技术), 它也不应该在其权力的极限,因此需要技术 为了达到预期的目标而扭曲。

这就是为什么“讨厌正则表达式似乎是一种时尚”。 虽然它们可以做很多事情,但有时需要非常难以阅读 编码来实现它,更不用说各种扩展了 而实施上的限制在一定程度上降低了它们的理论价值 简单。lexer通常不这样做,通常是一个简单的, 高效、合适的token解析技术。使用CF解析器 虽然有可能,但象征性就太过分了。

不为词法分析器使用CF形式主义的另一个原因是,它可能会 然后尝试使用全CF能力。但这可能会提高 有关程序读取的结构问题。

Fundamentally, most of the structure of program text, from which meaning is extracted, is a tree structure. It expresses how the parse sentence (program) is generated from syntax rules. Semantics is derived by compositional techniques (homomorphism for the mathematically oriented) from the way syntax rules are composed to build the parse tree. Hence the tree structure is essential. The fact that tokens are identified with a regular set based lexer does not change the situation, because CF composed with regular still gives CF (I am speaking very loosely about regular transducers, that transform a stream of characters into a stream of token).

然而,CF与CF组成(通过CF传感器…不好意思 数学),不一定给CF,和可能使事情更多 一般,但在实践中不太容易控制。所以CF不合适 lexers的工具,即使它可以使用。

常规和CF的主要区别之一是常规 语言(和转换器)几乎可以与任何语言很好地组合 形式主义在不同的方式,而CF语言(和传感器)做 没有,甚至连他们自己也没有(除了少数例外)。

(注意,常规换能器可能有其他用途,如 一些语法错误处理技术的形式化。)

BNF只是用于表示CF语法的特定语法。

EBNF是BNF的一个语法糖,使用了正则的功能 以给出更简洁的BNF语法版本。总是可以的 转化为等效的纯BNF。

然而,EBNF中通常使用常规符号只是为了强调这些 与词汇结构相对应的语法部分 元素,并且应该用lexer识别,而其余的用lexer识别 宁可以直的BNF形式呈现。但这并不是绝对的规则。

综上所述,使用更简单的token结构可以更好地进行分析 技术比较简单的常规语言,同时面向树 语言的结构(程序语法)更好地由CF处理 语法。

我建议你也看看AHR的答案。

但这留下了一个悬而未决的问题:为什么是树?

树是指定语法的良好基础,因为

它们使文章结构简单 语义与文本的关联是非常方便的 在这个结构的基础上,用数学方法很好地解释了 被理解的技术(通过同态的组合),如 上面显示。它是一个基本的代数工具来定义 数学形式主义的语义。

因此,它是一个很好的中间表示,如所示 抽象语法树的成功。注意AST通常是 不同于解析树是因为解析技术被很多人所采用 专业人员(如LL或LR)只应用于CF的一个子集 语法,从而导致后来的语法扭曲 这可以通过更通用的解析来避免 技术(基于动态编程),它接受任何CF语法。

关于编程语言是 上下文敏感(CS)而不是CF是任意的和有争议的。

问题是语法和语义的分离 任意的。检查声明或类型协议可能被视为 要么是语法,要么是语义。同样也适用于 自然语言中的性别与数字一致性。但是有自然的 复数是否一致取决于实际语义的语言 单词的意义,使它不太符合语法。

在指称语义学中对编程语言的许多定义 在语义中放置声明和类型检查。所以陈述为 Ira Baxter所做的CF解析器正在被黑客攻击以获得上下文 语法所要求的灵敏度充其量是任意的视图 的情况。在一些编译器中,它可能被组织成一个hack,但是它 不必如此。

此外,不仅仅是CS解析器(在这里的其他答案中使用的意义上)很难构建,而且更少 非常高效。它们也不足以明确地表达 可能需要一些上下文敏感性。但他们没有 自然地产生一种语法结构(如解析树) 是否方便推导程序的语义,即生成 编译后的代码。

什么时候lexing是足够的,什么时候你需要EBNF?

EBNF并没有给语法增加太多的功能。它只是标准乔姆斯基范式(CNF)语法规则之上的一个方便/快捷符号/“语法糖”。例如,EBNF替代方案:

S --> A | B

你可以在CNF中通过单独列出每个替代产品来实现:

S --> A      // `S` can be `A`,
S --> B      // or it can be `B`.

EBNF的可选元素:

S --> X?

你可以在CNF中通过使用一个可为空的结果来实现,也就是说,一个可以被空字符串替换的结果(这里仅用空结果表示;其他人使用或或或交叉圆):

S --> B       // `S` can be `B`,
B --> X       // and `B` can be just `X`,
B -->         // or it can be empty.

类似上面最后一个B的形式的产品被称为“擦除”,因为它可以擦除它在其他产品中所代表的任何内容(product为空字符串而不是其他内容)。

EBNF零或多次重复:

S --> A*

你可以通过使用递归生产来获得,也就是说,一个将自己嵌入其中的某个地方。有两种方法。首先是左递归(通常应该避免,因为自顶向下递归下降解析器无法解析它):

S --> S A    // `S` is just itself ended with `A` (which can be done many times),
S -->        // or it can begin with empty-string, which stops the recursion.

知道它只生成一个空字符串(最终)后面跟着0个或多个a,相同的字符串(但不是同一种语言!)可以使用右递归表示:

S --> A S    // `S` can be `A` followed by itself (which can be done many times),
S -->        // or it can be just empty-string end, which stops the recursion.

当涉及到EBNF的一次或多次重复时:

S --> A+

它可以通过提出一个A并像以前一样使用*来实现:

S --> A A*

你可以在CNF中这样表达(这里我使用右递归;试着自己找出另一个答案作为练习):

S --> A S   // `S` can be one `A` followed by `S` (which stands for more `A`s),
S --> A     // or it could be just one single `A`.

知道了这些,现在您可能可以将正则表达式的语法(即正则语法)识别为可以在仅由终端符号组成的单个EBNF结果中表示的语法。更一般地说,当你看到类似于这些的结果时,你可以识别常规语法:

A -->        // Empty (nullable) production (AKA erasure).
B --> x      // Single terminal symbol.
C --> y D    // Simple state change from `C` to `D` when seeing input `y`.
E --> F z    // Simple state change from `E` to `F` when seeing input `z`.
G --> G u    // Left recursion.
H --> v H    // Right recursion.

也就是说,只使用空字符串、终端符号、用于替换和状态更改的简单非终端,并且只使用递归来实现重复(迭代,这只是线性递归—不像树一样分支的递归)。没有更高级的,然后你确定它是一个常规的语法,你可以只使用lexer。

但是当你的语法以一种非平凡的方式使用递归,来产生树状、自相似的嵌套结构时,就像下面这个:

S --> a S b    // `S` can be itself "parenthesized" by `a` and `b` on both sides.
S -->          // or it could be (ultimately) empty, which ends recursion.

then you can easily see that this cannot be done with regular expression, because you cannot resolve it into one single EBNF production in any way; you'll end up with substituting for S indefinitely, which will always add another as and bs on both sides. Lexers (more specifically: Finite State Automata used by lexers) cannot count to arbitrary number (they are finite, remember?), so they don't know how many as were there to match them evenly with so many bs. Grammars like this are called context-free grammars (at the very least), and they require a parser.

Context-free grammars are well-known to parse, so they are widely used for describing programming languages' syntax. But there's more. Sometimes a more general grammar is needed -- when you have more things to count at the same time, independently. For example, when you want to describe a language where one can use round parentheses and square braces interleaved, but they have to be paired up correctly with each other (braces with braces, round with round). This kind of grammar is called context-sensitive. You can recognize it by that it has more than one symbol on the left (before the arrow). For example:

A R B --> A S B

您可以将左侧的这些附加符号视为应用规则的“上下文”。可能会有一些先决条件,后置条件等等。例如,上面的规则将R替换为S,但仅当它在A和B之间时,保持A和B本身不变。这种语法真的很难解析,因为它需要一个成熟的图灵机。这完全是另一回事,我就讲到这里。