词法分析器与解析器

词法分析器和解析器在理论上真的有那么大的不同吗?

讨厌正则表达式似乎很时髦:编程恐怖，另一篇博客文章。

然而，流行的基于词法的工具:pygest、geshi或prettify都使用正则表达式。他们似乎什么都能做……

什么时候lexing是足够的，什么时候你需要EBNF?

有人在bison或antlr解析器生成器中使用这些词法分析器生成的标记吗?

当前回答

为什么编译器的分析部分是正常的，原因有很多分为词法分析和解析(语法分析)两个阶段。

Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and white space as syntactic units would be. Considerably more complex than one that can assume comments and white space have already been removed by the lexical analyzer. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer.

resource___Compilers(第二版) 写的, 阿尔弗雷德·v·Abo 哥伦比亚大学林忆莲斯坦福大学拉维塞提亚美亚杰弗里·d·厄尔曼斯坦福大学

2014-03-10 17:40:27

其他回答

解析器和词法分析器的共同之处:

They read symbols of some alphabet from their input. Hint: The alphabet doesn't necessarily have to be of letters. But it has to be of symbols which are atomic for the language understood by parser/lexer. Symbols for the lexer: ASCII characters. Symbols for the parser: the particular tokens, which are terminal symbols of their grammar. They analyse these symbols and try to match them with the grammar of the language they understood. Here's where the real difference usually lies. See below for more. Grammar understood by lexers: regular grammar (Chomsky's level 3). Grammar understood by parsers: context-free grammar (Chomsky's level 2). They attach semantics (meaning) to the language pieces they find. Lexers attach meaning by classifying lexemes (strings of symbols from the input) as the particular tokens. E.g. All these lexemes: *, ==, <=, ^ will be classified as "operator" token by the C/C++ lexer. Parsers attach meaning by classifying strings of tokens from the input (sentences) as the particular nonterminals and building the parse tree. E.g. all these token strings: [number][operator][number], [id][operator][id], [id][operator][number][operator][number] will be classified as "expression" nonterminal by the C/C++ parser. They can attach some additional meaning (data) to the recognized elements. When a lexer recognizes a character sequence constituting a proper number, it can convert it to its binary value and store with the "number" token. Similarly, when a parser recognize an expression, it can compute its value and store with the "expression" node of the syntax tree. They all produce on their output a proper sentences of the language they recognize. Lexers produce tokens, which are sentences of the regular language they recognize. Each token can have an inner syntax (though level 3, not level 2), but that doesn't matter for the output data and for the one which reads them. Parsers produce syntax trees, which are representations of sentences of the context-free language they recognize. Usually it's only one big tree for the whole document/source file, because the whole document/source file is a proper sentence for them. But there aren't any reasons why parser couldn't produce a series of syntax trees on its output. E.g. it could be a parser which recognizes SGML tags sticked into plain-text. So it'll tokenize the SGML document into a series of tokens: [TXT][TAG][TAG][TXT][TAG][TXT]....

As you can see, parsers and tokenizers have much in common. One parser can be a tokenizer for other parser, which reads its input tokens as symbols from its own alphabet (tokens are simply symbols of some alphabet) in the same way as sentences from one language can be alphabetic symbols of some other, higher-level language. For example, if * and - are the symbols of the alphabet M (as "Morse code symbols"), then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code. The sentences in the language "Morse Code" could be tokens for some other parser, for which these tokens are atomic symbols of its language (e.g. "English Words" language). And these "English Words" could be tokens (symbols of the alphabet) for some higher-level parser which understands "English Sentences" language. And all these languages differ only in the complexity of the grammar. Nothing more.

那么，这些“乔姆斯基的语法水平”到底是怎么回事呢?诺姆·乔姆斯基根据语法的复杂程度将其分为四个等级:

Level 3: Regular grammars They use regular expressions, that is, they can consist only of the symbols of alphabet (a,b), their concatenations (ab,aba,bbb etd.), or alternatives (e.g. a|b).They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton).Regular grammars can't handle with nested syntax, e.g. properly nested/matched parentheses (()()(()())), nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels. Level 2: Context-free grammars They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well.They can be implemented as state automaton with stack. This stack is used to represent the nesting level of the syntax. In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax.But they can't handle with a context-sensitive syntax. E.g. when you have an expression x+3 and in one context this x could be a name of a variable, and in other context it could be a name of a function etc. Level 1: Context-sensitive grammars Level 0: Unrestricted grammarsAlso called recursively enumerable grammars.

2010-09-01 03:53:27

是的，它们在理论上和执行上都有很大的不同。

词法分析器用于识别构成语言元素的“单词”，因为这些单词的结构通常很简单。正则表达式非常擅长处理这种更简单的结构，并且有非常高性能的正则表达式匹配引擎用于实现lexer。

解析器用于识别语言短语的“结构”。这样的结构通常远远超出了“正则表达式”所能识别的范围，因此需要 “上下文敏感”解析器来提取这样的结构。上下文敏感的解析器很难构建，所以工程上的妥协是使用“上下文无关”语法并在解析器(“符号表”等)中添加一些技巧来处理上下文敏感的部分。

词法分析和解析技术都不太可能很快消失。

They may be unified by deciding to use "parsing" technology to recognize "words", as is currently explored by so-called scannerless GLR parsers. That has a runtime cost, as you are applying more general machinery to what is often a problem that doesn't need it, and usually you pay for that in overhead. Where you have lots of free cycles, that overhead may not matter. If you process a lot of text, then the overhead does matter and classical regular expression parsers will continue to be used.

2010-05-17 20:52:45

回答所问的问题(不过分重复中出现的内容) 其他答案)

Lexers and parsers are not very different, as suggested by the accepted answer. Both are based on simple language formalisms: regular languages for lexers and, almost always, context-free (CF) languages for parsers. They both are associated with fairly simple computational models, the finite state automaton and the push-down stack automaton. Regular languages are a special case of context-free languages, so that lexers could be produced with the somewhat more complex CF technology. But it is not a good idea for at least two reasons.

编程的一个基本要点是系统组件应该采用最合适的技术，使之易于实现生产，理解和维护。技术不应该如此过度杀戮(使用比实际需要复杂和昂贵得多的技术)，它也不应该在其权力的极限，因此需要技术为了达到预期的目标而扭曲。

这就是为什么“讨厌正则表达式似乎是一种时尚”。虽然它们可以做很多事情，但有时需要非常难以阅读编码来实现它，更不用说各种扩展了而实施上的限制在一定程度上降低了它们的理论价值简单。lexer通常不这样做，通常是一个简单的，高效、合适的token解析技术。使用CF解析器虽然有可能，但象征性就太过分了。

不为词法分析器使用CF形式主义的另一个原因是，它可能会然后尝试使用全CF能力。但这可能会提高有关程序读取的结构问题。

Fundamentally, most of the structure of program text, from which meaning is extracted, is a tree structure. It expresses how the parse sentence (program) is generated from syntax rules. Semantics is derived by compositional techniques (homomorphism for the mathematically oriented) from the way syntax rules are composed to build the parse tree. Hence the tree structure is essential. The fact that tokens are identified with a regular set based lexer does not change the situation, because CF composed with regular still gives CF (I am speaking very loosely about regular transducers, that transform a stream of characters into a stream of token).

然而，CF与CF组成(通过CF传感器…不好意思数学)，不一定给CF，和可能使事情更多一般，但在实践中不太容易控制。所以CF不合适 lexers的工具，即使它可以使用。

常规和CF的主要区别之一是常规语言(和转换器)几乎可以与任何语言很好地组合形式主义在不同的方式，而CF语言(和传感器)做没有，甚至连他们自己也没有(除了少数例外)。

(注意，常规换能器可能有其他用途，如一些语法错误处理技术的形式化。)

BNF只是用于表示CF语法的特定语法。

EBNF是BNF的一个语法糖，使用了正则的功能以给出更简洁的BNF语法版本。总是可以的转化为等效的纯BNF。

然而，EBNF中通常使用常规符号只是为了强调这些与词汇结构相对应的语法部分元素，并且应该用lexer识别，而其余的用lexer识别宁可以直的BNF形式呈现。但这并不是绝对的规则。

综上所述，使用更简单的token结构可以更好地进行分析技术比较简单的常规语言，同时面向树语言的结构(程序语法)更好地由CF处理语法。

我建议你也看看AHR的答案。

但这留下了一个悬而未决的问题:为什么是树?

树是指定语法的良好基础，因为

它们使文章结构简单语义与文本的关联是非常方便的在这个结构的基础上，用数学方法很好地解释了被理解的技术(通过同态的组合)，如上面显示。它是一个基本的代数工具来定义数学形式主义的语义。

因此，它是一个很好的中间表示，如所示抽象语法树的成功。注意AST通常是不同于解析树是因为解析技术被很多人所采用专业人员(如LL或LR)只应用于CF的一个子集语法，从而导致后来的语法扭曲这可以通过更通用的解析来避免技术(基于动态编程)，它接受任何CF语法。

关于编程语言是上下文敏感(CS)而不是CF是任意的和有争议的。

问题是语法和语义的分离任意的。检查声明或类型协议可能被视为要么是语法，要么是语义。同样也适用于自然语言中的性别与数字一致性。但是有自然的复数是否一致取决于实际语义的语言单词的意义，使它不太符合语法。

在指称语义学中对编程语言的许多定义在语义中放置声明和类型检查。所以陈述为 Ira Baxter所做的CF解析器正在被黑客攻击以获得上下文语法所要求的灵敏度充其量是任意的视图的情况。在一些编译器中，它可能被组织成一个hack，但是它不必如此。

此外，不仅仅是CS解析器(在这里的其他答案中使用的意义上)很难构建，而且更少非常高效。它们也不足以明确地表达可能需要一些上下文敏感性。但他们没有自然地产生一种语法结构(如解析树) 是否方便推导程序的语义，即生成编译后的代码。

2014-06-11 14:19:30

为什么编译器的分析部分是正常的，原因有很多分为词法分析和解析(语法分析)两个阶段。

resource___Compilers(第二版) 写的, 阿尔弗雷德·v·Abo 哥伦比亚大学林忆莲斯坦福大学拉维塞提亚美亚杰弗里·d·厄尔曼斯坦福大学

2014-03-10 17:40:27

解析器通常会组合词法分析器生成的标记并对它们进行分组。

解析器定义为对输入进行分析以组织数据根据语法规则，lexer转换符号序列中的字符序列

让我们看看下面的例子，并想象我们正在尝试解析一个加法。

437 + 734

lexer扫描文本并找到4,3,7和一个空格()。词法分析器的工作是识别字符437构成一个类型为NUM的令牌。

然后lexer找到一个+符号，它对应于类型为PLUS的第二个令牌，最后它找到类型为NUM的另一个令牌。

查看更多细节: 解析指南:算法和术语

2022-04-25 15:23:17

词法分析器与解析器

推荐文章

最新文章

标签