词法分析器和解析器在理论上真的有那么大的不同吗?

讨厌正则表达式似乎很时髦:编程恐怖,另一篇博客文章。

然而,流行的基于词法的工具:pygest、geshi或prettify都使用正则表达式。他们似乎什么都能做……

什么时候lexing是足够的,什么时候你需要EBNF?

有人在bison或antlr解析器生成器中使用这些词法分析器生成的标记吗?


当前回答

为什么编译器的分析部分是正常的,原因有很多 分为词法分析和解析(语法分析)两个阶段。

Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and white space as syntactic units would be. Considerably more complex than one that can assume comments and white space have already been removed by the lexical analyzer. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer.

resource___Compilers(第二版) 写的, 阿尔弗雷德·v·Abo 哥伦比亚大学 林忆莲 斯坦福大学 拉维塞提 亚美亚 杰弗里·d·厄尔曼 斯坦福大学

其他回答

是的,它们在理论上和执行上都有很大的不同。

词法分析器用于识别构成语言元素的“单词”,因为这些单词的结构通常很简单。正则表达式非常擅长处理这种更简单的结构,并且有非常高性能的正则表达式匹配引擎用于实现lexer。

解析器用于识别语言短语的“结构”。这样的结构通常远远超出了“正则表达式”所能识别的范围,因此需要 “上下文敏感”解析器来提取这样的结构。上下文敏感的解析器 很难构建,所以工程上的妥协是使用“上下文无关”语法 并在解析器(“符号表”等)中添加一些技巧来处理上下文敏感的部分。

词法分析和解析技术都不太可能很快消失。

They may be unified by deciding to use "parsing" technology to recognize "words", as is currently explored by so-called scannerless GLR parsers. That has a runtime cost, as you are applying more general machinery to what is often a problem that doesn't need it, and usually you pay for that in overhead. Where you have lots of free cycles, that overhead may not matter. If you process a lot of text, then the overhead does matter and classical regular expression parsers will continue to be used.

解析器和词法分析器的共同之处:

They read symbols of some alphabet from their input. Hint: The alphabet doesn't necessarily have to be of letters. But it has to be of symbols which are atomic for the language understood by parser/lexer. Symbols for the lexer: ASCII characters. Symbols for the parser: the particular tokens, which are terminal symbols of their grammar. They analyse these symbols and try to match them with the grammar of the language they understood. Here's where the real difference usually lies. See below for more. Grammar understood by lexers: regular grammar (Chomsky's level 3). Grammar understood by parsers: context-free grammar (Chomsky's level 2). They attach semantics (meaning) to the language pieces they find. Lexers attach meaning by classifying lexemes (strings of symbols from the input) as the particular tokens. E.g. All these lexemes: *, ==, <=, ^ will be classified as "operator" token by the C/C++ lexer. Parsers attach meaning by classifying strings of tokens from the input (sentences) as the particular nonterminals and building the parse tree. E.g. all these token strings: [number][operator][number], [id][operator][id], [id][operator][number][operator][number] will be classified as "expression" nonterminal by the C/C++ parser. They can attach some additional meaning (data) to the recognized elements. When a lexer recognizes a character sequence constituting a proper number, it can convert it to its binary value and store with the "number" token. Similarly, when a parser recognize an expression, it can compute its value and store with the "expression" node of the syntax tree. They all produce on their output a proper sentences of the language they recognize. Lexers produce tokens, which are sentences of the regular language they recognize. Each token can have an inner syntax (though level 3, not level 2), but that doesn't matter for the output data and for the one which reads them. Parsers produce syntax trees, which are representations of sentences of the context-free language they recognize. Usually it's only one big tree for the whole document/source file, because the whole document/source file is a proper sentence for them. But there aren't any reasons why parser couldn't produce a series of syntax trees on its output. E.g. it could be a parser which recognizes SGML tags sticked into plain-text. So it'll tokenize the SGML document into a series of tokens: [TXT][TAG][TAG][TXT][TAG][TXT]....

As you can see, parsers and tokenizers have much in common. One parser can be a tokenizer for other parser, which reads its input tokens as symbols from its own alphabet (tokens are simply symbols of some alphabet) in the same way as sentences from one language can be alphabetic symbols of some other, higher-level language. For example, if * and - are the symbols of the alphabet M (as "Morse code symbols"), then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code. The sentences in the language "Morse Code" could be tokens for some other parser, for which these tokens are atomic symbols of its language (e.g. "English Words" language). And these "English Words" could be tokens (symbols of the alphabet) for some higher-level parser which understands "English Sentences" language. And all these languages differ only in the complexity of the grammar. Nothing more.

那么,这些“乔姆斯基的语法水平”到底是怎么回事呢?诺姆·乔姆斯基根据语法的复杂程度将其分为四个等级:

Level 3: Regular grammars They use regular expressions, that is, they can consist only of the symbols of alphabet (a,b), their concatenations (ab,aba,bbb etd.), or alternatives (e.g. a|b).They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton).Regular grammars can't handle with nested syntax, e.g. properly nested/matched parentheses (()()(()())), nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels. Level 2: Context-free grammars They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well.They can be implemented as state automaton with stack. This stack is used to represent the nesting level of the syntax. In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax.But they can't handle with a context-sensitive syntax. E.g. when you have an expression x+3 and in one context this x could be a name of a variable, and in other context it could be a name of a function etc. Level 1: Context-sensitive grammars Level 0: Unrestricted grammarsAlso called recursively enumerable grammars.

解析器通常会组合词法分析器生成的标记并对它们进行分组。

解析器定义为对输入进行分析以组织数据 根据语法规则,lexer转换 符号序列中的字符序列

让我们看看下面的例子,并想象我们正在尝试解析一个加法。

437 + 734

lexer扫描文本并找到4,3,7和一个空格()。词法分析器的工作是识别字符437构成一个类型为NUM的令牌。

然后lexer找到一个+符号,它对应于类型为PLUS的第二个令牌,最后它找到类型为NUM的另一个令牌。

查看更多细节: 解析指南:算法和术语

为什么编译器的分析部分是正常的,原因有很多 分为词法分析和解析(语法分析)两个阶段。

Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and white space as syntactic units would be. Considerably more complex than one that can assume comments and white space have already been removed by the lexical analyzer. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer.

resource___Compilers(第二版) 写的, 阿尔弗雷德·v·Abo 哥伦比亚大学 林忆莲 斯坦福大学 拉维塞提 亚美亚 杰弗里·d·厄尔曼 斯坦福大学

什么时候lexing是足够的,什么时候你需要EBNF?

EBNF并没有给语法增加太多的功能。它只是标准乔姆斯基范式(CNF)语法规则之上的一个方便/快捷符号/“语法糖”。例如,EBNF替代方案:

S --> A | B

你可以在CNF中通过单独列出每个替代产品来实现:

S --> A      // `S` can be `A`,
S --> B      // or it can be `B`.

EBNF的可选元素:

S --> X?

你可以在CNF中通过使用一个可为空的结果来实现,也就是说,一个可以被空字符串替换的结果(这里仅用空结果表示;其他人使用或或或交叉圆):

S --> B       // `S` can be `B`,
B --> X       // and `B` can be just `X`,
B -->         // or it can be empty.

类似上面最后一个B的形式的产品被称为“擦除”,因为它可以擦除它在其他产品中所代表的任何内容(product为空字符串而不是其他内容)。

EBNF零或多次重复:

S --> A*

你可以通过使用递归生产来获得,也就是说,一个将自己嵌入其中的某个地方。有两种方法。首先是左递归(通常应该避免,因为自顶向下递归下降解析器无法解析它):

S --> S A    // `S` is just itself ended with `A` (which can be done many times),
S -->        // or it can begin with empty-string, which stops the recursion.

知道它只生成一个空字符串(最终)后面跟着0个或多个a,相同的字符串(但不是同一种语言!)可以使用右递归表示:

S --> A S    // `S` can be `A` followed by itself (which can be done many times),
S -->        // or it can be just empty-string end, which stops the recursion.

当涉及到EBNF的一次或多次重复时:

S --> A+

它可以通过提出一个A并像以前一样使用*来实现:

S --> A A*

你可以在CNF中这样表达(这里我使用右递归;试着自己找出另一个答案作为练习):

S --> A S   // `S` can be one `A` followed by `S` (which stands for more `A`s),
S --> A     // or it could be just one single `A`.

知道了这些,现在您可能可以将正则表达式的语法(即正则语法)识别为可以在仅由终端符号组成的单个EBNF结果中表示的语法。更一般地说,当你看到类似于这些的结果时,你可以识别常规语法:

A -->        // Empty (nullable) production (AKA erasure).
B --> x      // Single terminal symbol.
C --> y D    // Simple state change from `C` to `D` when seeing input `y`.
E --> F z    // Simple state change from `E` to `F` when seeing input `z`.
G --> G u    // Left recursion.
H --> v H    // Right recursion.

也就是说,只使用空字符串、终端符号、用于替换和状态更改的简单非终端,并且只使用递归来实现重复(迭代,这只是线性递归—不像树一样分支的递归)。没有更高级的,然后你确定它是一个常规的语法,你可以只使用lexer。

但是当你的语法以一种非平凡的方式使用递归,来产生树状、自相似的嵌套结构时,就像下面这个:

S --> a S b    // `S` can be itself "parenthesized" by `a` and `b` on both sides.
S -->          // or it could be (ultimately) empty, which ends recursion.

then you can easily see that this cannot be done with regular expression, because you cannot resolve it into one single EBNF production in any way; you'll end up with substituting for S indefinitely, which will always add another as and bs on both sides. Lexers (more specifically: Finite State Automata used by lexers) cannot count to arbitrary number (they are finite, remember?), so they don't know how many as were there to match them evenly with so many bs. Grammars like this are called context-free grammars (at the very least), and they require a parser.

Context-free grammars are well-known to parse, so they are widely used for describing programming languages' syntax. But there's more. Sometimes a more general grammar is needed -- when you have more things to count at the same time, independently. For example, when you want to describe a language where one can use round parentheses and square braces interleaved, but they have to be paired up correctly with each other (braces with braces, round with round). This kind of grammar is called context-sensitive. You can recognize it by that it has more than one symbol on the left (before the arrow). For example:

A R B --> A S B

您可以将左侧的这些附加符号视为应用规则的“上下文”。可能会有一些先决条件,后置条件等等。例如,上面的规则将R替换为S,但仅当它在A和B之间时,保持A和B本身不变。这种语法真的很难解析,因为它需要一个成熟的图灵机。这完全是另一回事,我就讲到这里。