用自己的语言编写编译器

直观地说，Foo语言的编译器本身似乎不能用Foo来编写。更具体地说，Foo语言的第一个编译器不能用Foo编写，但任何后续的编译器都可以为Foo编写。

但这是真的吗?我隐约记得读过一种语言，它的第一个编译器是用“自身”编写的。这可能吗?如果可能，如何实现?

当前回答

这就是所谓的“自我引导”。你必须首先用其他语言(通常是Java或C)为你的语言构建一个编译器(或解释器)。一旦完成了这一步，你就可以用语言Foo编写一个新版本的编译器。您使用第一个引导编译器来编译编译器，然后使用这个编译器来编译其他所有内容(包括自身的未来版本)。

大多数语言确实是以这种方式创建的，部分原因是语言设计者喜欢使用他们正在创建的语言，也因为非平凡的编译器通常可以作为语言“完整”程度的有用基准。

Scala就是一个例子。它的第一个编译器是用Martin Odersky的实验语言Pizza创建的。从2.0版开始，编译器完全用Scala重新编写。从那时起，旧的Pizza编译器可以完全被丢弃，因为新的Scala编译器可以用于将来的迭代中编译自己。

2008-10-11 01:34:48

其他回答

一般来说，你需要先让编译器工作(如果是原始的)，然后你才能开始考虑让它自托管。在某些语言中，这实际上被认为是一个重要的里程碑。

从我对“mono”的印象来看，他们很可能需要给反射添加一些东西来让它工作:mono团队一直指出，有些东西根本不可能用reflection . emit;当然，微软团队可能会证明他们错了。

这有一些真正的优点:对于初学者来说，这是一个相当好的单元测试!你只需要担心一种语言(也就是说，c#专家可能不太懂c++;但是现在你可以修复c#编译器)。但我想知道，这里是否存在某种职业自豪感:他们只是希望它能够自我托管。

不完全是编译器，但我最近一直在研究一个自托管系统;代码生成器用于生成代码生成器…所以如果模式改变了，我简单地运行它本身:新版本。如果有错误，我就返回到早期版本并重试。非常方便，也很容易维护。

更新1

我刚刚看了Anders在PDC的视频，(大约一个小时后)他给出了一些更合理的理由——都是关于编译器即服务的。只是为了记录。

2008-11-02 09:00:08

我自己编写了SLIC(实现编译器的语言系统)。然后手工编译成汇编。有很多关于SLIC的东西，因为它是一个由五个子语言组成的单一编译器:

语法分析器编程语言PPL GENERATOR基于lisp2的树爬伪代码生成语言 ISO In Sequence, PSEUDO代码，优化语言伪宏类汇编代码生成语言。汇编机指令定义语言。

SLIC的灵感来自CWIC(用于编写和实现编译器的编译器)。与大多数编译器开发包不同，SLIC和CWIC使用专门的、领域特定的语言来处理代码生成。SLIC扩展了CWICs代码生成，添加了ISO、PSEUDO和MACHOP子语言，将目标机器细节从树爬行生成器语言中分离出来。

lisp2树和列表

基于lisp2的生成器语言的动态内存管理系统是一个关键组件。列表用方括号括起来的语言表示，其组成部分用逗号分隔，即一个包含三个元素的列表[a,b,c]。

树:

     ADD
    /   \
  MPY     3
 /   \
5     x

由首项为节点对象的列表表示:

[ADD,[MPY,5,x],3]

树通常在分支之前单独显示节点:

ADD[MPY[5,x],3]

使用基于lisp2的生成器函数进行解析

生成器函数是一组(unparse)=>动作>对…

<NAME>(<unparse>)=><action>;
      (<unparse>)=><action>;
            ...
      (<unparse>)=><action>;

Unparse expressions are tests that match tree patterns and/or object types breaking them apart and assigning those parts to local variable to be processed by its procedural action. Kind of like an overloaded function taking different argument types. Except the ()=> ... tests are attempted in the order coded. The first successful unparse executing its corresponding action. The unparse expressions are disassembling tests. ADD[x,y] matches a two branch ADD tree assigning its branches to local variables x and y. The action may be a simple expression or a .BEGIN ... .END bounded code block. I would use c style { ... } blocks today. Tree matching, [], unparse rules may call generators passing the returned result(s) to the action:

expr_gen(ADD[expr_gen(x),expr_gen(y)])=> x+y;

Specifically the above expr_gen unparse matches a two branch ADD tree. Within the test pattern a single argument generator placed in a tree branch will be called with that branch. Its argument list though are local variables assigned returned objects. Above the unparse specifies a two branch is ADD tree disassembly, recursive pressing each branch to expr_gen. The left branch return placed into into local variables x. Likewise the right branch passed to expr_gen with y the return object. The above could be part of a numeric expression evaluator. There were shortcut features called vectors were in the above instead of the node string a vector of nodes could be used with a vector of corresponding actions:

expr_gen(#node[expr_gen(x),expr_gen(y)])=> #action;

  node:   ADD, SUB, MPY, DIV;
  action: x+y, x-y, x*y, x/y;

        (NUMBER(x))=> x;
        (SYMBOL(x))=> val:(x);

上面更完整的表达式求值器将expr_gen左分支返回值赋给x，右分支返回值赋给y。对应的操作向量在x和y上执行返回值。最后一个unparse=>操作对匹配数字和符号对象。

符号和符号属性

符号可以有命名属性。val:(x)访问x中包含的符号对象的val属性。通用符号表堆栈是SLIC的一部分。符号表可以被推和弹出，为函数提供局部符号。新创建的符号被编目在顶部的符号表中。符号查找首先从顶表向后向下查找符号表堆栈。

生成机器独立代码

SLIC's generator language produces PSEUDO instruction objects, appending them to a sections code list. A .FLUSH causes its PSEUDO code list to be run removing each PSEUDO instruction from the list and calling it. After execution a PSEUDO objects memory is released. The procedural bodies of PSEUDOs and GENERATOR actions are basically the same language except for their output. PSEUDO are meant to act as assembly macros providing machine independent code sequentialization. They provide a separation of the specific target machine out of the tree crawling generator language. PSEUDOs call MACHOP functions to output machine code. MACHOPs are used to define assembly pseudo ops (like dc, define constant etc) and machine instruction or a family of like formated instructions using vectored entry. They simply transform their parameters into a sequence of bit fields making up the instruction. MACHOP calls are meant to look like assembly and provide print formatting of the fields for when assembly is shown in the compile listing. In the example code I am using c style commenting that could be easily added but was not in the original languages. MACHOPs are producing code into a bit addressable memory. The SLIC linker handles output of the compiler. A MACHOP for the DEC-10 user mode instructions using vectored entry:

.MACHOP #opnm register,@indirect offset (index): // Instruction's parameters.
.MORG 36, O(18): $/36; // Align to 36 bit boundary print format: 18 bit octal $/36
O(9):  #opcd;          // Op code 9 bit octal print out
 (4):  register;       // 4 bit register field appended print
 (1):  indirect;       // 1 bit appended print
 (4):  index;          // 4 bit index register appended print
O(18): if (#opcd&&3==1) offset // immediate mode use value else
       else offset/36;         // memory address divide by 36
                               // to get word address.
// Vectored entry opcode table:
#opnm := MOVE, MOVEI, MOVEM, MOVES, MOVS, MOVSI, MOVSM, MOVSS,
         MOVN, MOVNI, MOVNM, MOVNS, MOVM, MOVMI, MOVMM, MOVMS,
         IMUL, IMULI, IMULM, IMULB, MUL,  MULI,  MULM,  MULB,
                           ...
         TDO,  TSO,   TDOE,  TSOE,  TDOA, TSOA,  TDON,  TSON;
// corresponding opcode value:
#opcd := 0O200, 0O201, 0O202, 0O203, 0O204, 0O205, 0O206, 0O207,
         0O210, 0O211, 0O212, 0O213, 0O214, 0O215, 0O216, 0O217,
         0O220, 0O221, 0O222, 0O223, 0O224, 0O225, 0O226, 0O227,
                           ...
         0O670, 0O671, 0O672, 0O673, 0O674, 0O675, 0O676, 0O677;

The .MORG 36, O(18): $/36;将位置对齐到36位边界，打印八进制的18位位置$/36字地址。9位opcd,4位寄存器，间接位和4位索引寄存器被组合并打印，就像单个18位字段一样。18位地址/36或立即值以八进制输出并打印。一个MOVEI的例子打印r1 = 1和r2=2:

400020 201082 000005            MOVEI r1,5(r2)

使用编译器程序集选项，可以在编译清单中获得生成的程序集代码。

连接在一起

SLIC链接器是作为处理链接和符号解析的库提供的。目标特定的输出加载文件格式必须为目标机器编写，并与链接器库库链接。

生成器语言能够将树写入文件并读取它们，从而实现多通道编译器。

代码生成和起源的简短总结

I have went over the code generation first to insure it is understood that SLIC was a true compiler compiler. SLIC was inspired by CWIC (Compiler for Writing and Implementing Compilers) developed at Systems Development Corporation in the late 1960s. CWIC only had SYNTAX and GENERATOR languages producing numeric byte code out of the GENERATOR language. Byte code was placed or planted (the term used in CWICs documentation) into memory buffers associated with named sections and written out by a .FLUSH statement. An ACM paper on CWIC is available from the ACM archives.

成功地实现了一种主要的编程语言

在20世纪70年代后期，SLIC被用来编写COBOL交叉编译器。主要由一名程序员在3个月内完成。根据需要，我和程序员一起工作了一会儿。另一个程序员为目标TI-990迷你计算机编写了运行时库和MACHOPs。这个COBOL编译器每秒编译的行数比用汇编编写的12 -10原生COBOL编译器多得多。

更多关于编译器的内容

从头编写编译器的一个重要部分是运行时库。你需要一个符号表。你需要输入和输出。动态内存管理等。为编译器编写运行时库比编写编译器更容易。但是在SLIC中，运行时库对于所有在SLIC中开发的编译器都是通用的。注意有两个运行时库。一个用于该语言的目标机器(例如COBOL)。另一个是编译器运行时库。

我认为我已经确定这些不是解析器生成器。因此，现在只要稍微了解一下后端，我就可以解释解析器编程语言了。

解析器编程语言

解析器是使用简单方程形式的公式编写的。

<name> <formula type operator> <expression> ;

The language element at the lowest level is the character. Tokens are formed from a subsets of the characters of the language. Character classes are used to name and define those character subsets. The character class defining operator is the colon (:) character. Characters that are members of the class are coded on the right side of the definition. Printable characters are enclosed in primes single ' strings. Nonprinting and special characters may be represented by their numeric ordinal. Class members are separated by an alternative | operator. A class formula ends with a semicolon. Character classes may include previously defined classes:

/*  Character Class Formula                                    class_mask */
bin: '0'|'1';                                                // 0b00000010
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';                            // 0b00000110
dgt: oct|'8'|'9';                                            // 0b00001110
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f';    // 0b00011110
upr:  'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
      'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';   // 0b00100000
lwr:  'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
      'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';   // 0b01000000
alpha:  upr|lwr;                                             // 0b01100000
alphanum: alpha|dgt;                                         // 0b01101110

skip_class 0b00000001是预定义的，但可能超出了定义skip_class的范围。

In summary: A character class is a list of alternative that can only be a character constant, a character's ordinal, or a previously defined character class. As I implemented character classes: The class formula is assigned a class bit mask. (Shown in comments above) Any class formula having any character literal or ordinal causes a class bit to be allocated. A mask is made by oring the included class(es)'s class mask(s) together with the allocated bit (if any). A class table is created from the character classes. An entry indexed by a character's ordinal contains bits indicating the character's class memberships. Class testing is done inline. An IA-86 code example with the character's ordinal in eax illustrates class testing:

test    byte ptr [eax+_classmap],dgt

后面跟着a:

jne      <success>

je       <failure>

之所以使用IA-86指令代码示例，是因为我认为IA-86指令在今天更加广为人知。对类掩码求值的类名与按序号(在eax中)索引的类表进行非破坏性的and运算。非零结果表示类成员。(EAX除包含该字符的al(EAX的低8位)外均为零)。

Tokens were a bit different in these old compilers. Key words were not explained as tokens. They simply were matched by quoted string constants in the parser language. Quoted strings are not normally kept. Modifiers may be used. A + keeps the string matched. (i.e. +'-' matches a - character keeping the character when successful) The , operation (i.e. ,'E') inserts the string into the token. White space is handled by the token formula skipping leading SKIP_CLASS characters until a first match is made. Note that an explicit skip_class character match will stop the skipping allowing a token to start with a skip_class character. The string token formula skips leading skip_class characters matching a single quote quitedd character or a double quoted string. Of interest is the matching a " character within a " quoted string:

string .. (''' .ANY ''' | '"' $(-"""" .ANY | """""","""") '"') MAKSTR[];

The first alternative matches any single quote quoted character. The right alternative matches a double quote quoted string that may include double quote characters using two " character together to represent a single " character. This formula defines the strings used in its own definition. The inner right alternative '"' $(-"""" .ANY | """""","""") '"' matches a double quote quoted string. We can use a single ' quoted character to match a double quote " character. However within the double " quoted string if we wish to use a " character we must use two " characters to get one. For example in the inner left alternative matching any character except a quote:

-"""" .ANY

一个消极的提前窥视-""""被使用，当成功(不匹配“字符)然后匹配。any字符(不能是”字符，因为-""""消除了这种可能性)。正确的选择是接受-""""匹配“字符和失败是正确的选择:

"""""",""""

tries to match two " characters replacing them with a single double " using ,"""" to inserting thw single " character. Both inner alternatives failing the closing string quote character is matched and MAKSTR[] called to create a string object. The $ sequence, loop while successful, operator is used in matching a sequence. Token formula skip leading skip class characters(whit space). Once a first match is made skip_class skipping is disabled. We can call functions programed in other languages using []. MAKSTR[], MAKBIN[], MAKOCT[], MAKHEX[], MAKFLOAT[], and MAKINT[] are supplied library function that convert a matched token string to a typed object. The number formula below illustrates a fairly complex token recognition:

number .. "0B" bin $bin MAKBIN[]        // binary integer
         |"0O" oct $oct MAKOCT[]        // octal integer
         |("0H"|"0X") hex $hex MAKHEX[] // hexadecimal integer
// look for decimal number determining if integer or floating point.
         | ('+'|+'-'|--)                // only - matters
           dgt $dgt                     // integer part
           ( +'.' $dgt                  // fractional part?
              ((+'E'|'e','E')           // exponent  part
               ('+'|+'-'|--)            // Only negative matters
               dgt(dgt(dgt|--)|--)|--)  // 1 2 or 3 digit exponent
             MAKFLOAT[] )               // floating point
           MAKINT[];                    // decimal integer

上面的数字令牌公式可以识别整数和浮点数。替代方案总是成功的。数值对象可用于计算。公式成功时，令牌对象被推入解析堆栈。(+'E'|' E'，'E')中的指数前导很有趣。我们希望MAKEFLOAT[]始终有一个大写的E。但是我们允许用小写的e来代替它。

You may have noticed consistencies of character class and token formula. The parsing formula continue that adding backtracking alternatives and tree construction operators. Backtracking and non-backtracking alternative operators may not be mixed within an expression level. You may not have (a | b \ c) mixing non-backtracking | withe \ backtracking alternative. (a\b\c), (a|b|c) and ((a|b)\c) are valid. A \ backtracking alternative saves the parse state before attempting its left alternative and on failure restores the parse state before attempting the right alternative. In a sequence of alternatives the first successful alternative satisfies the group. Further alternatives are not attempted. Factoring and grouping provides for a continuous advancing parse. The backtrack alternative creates a saved state of the parse before it attempts its left alternative. Backtracking is required when the parse may make a partial match and then fail:

(a b | c d)\ e

In the above if a returns failure the alternative c d is attempted. If then c returns failure the backtrack alternative will be attempted. If a succeeds and b fails the parse wile be backtracked and e attempted. Likewise a failing c successful and b fails the parse is backtracked and the alternative e taken. Backtracking is not limited to within a formula. If any parsing formula makes a partial match at any time and then fails the parse is reset to the top backtrack and its alternative taken. A compile failure can occur if code has been output sense the backtrack was created. A backtrack is set before starting the compile. Returning failure or backtracking to it is a compiler failure. Backtracks are stacked. We may use negative - and positive ? peek/look ahead operators to test without advancing the parse. being string test is a peek ahead only needing the input state saved and reset. A look ahead would be a parsing expression that makes a partial match before failing. A look ahead is implemented using backtracking.

解析器语言既不是LL解析器，也不是LR解析器。而是一种编写递归解析器的编程语言，你可以在其中编写树结构:

:<node name> creates a node object and pushes it onto the node stack.
..           Token formula create token objects and push them onto 
             the parse stack.
!<number>    pops the top node object and top <number> of parstack 
             entries into a list representation of the tree. The 
             tree then pushed onto the parse stack.
+[ ... ]+    creates a list of the parse stack entries created 
             between them:
              '(' +[argument $(',' argument]+ ')'
             could parse an argument list. into a list.

一个常用的解析示例是一个算术表达式:

Exp = Term $(('+':ADD|'-':SUB) Term!2); 
Term = Factor $(('*':MPY|'/':DIV) Factor!2);
Factor = ( number
         | id  ( '(' +[Exp $(',' Exp)]+ ')' :FUN!2
               | --)
         | '(' Exp ')" )
         (^' Factor:XPO!2 |--);

Exp和Term使用循环创建左手树。使用右递归的因子创建一个右手树:

d^(x+5)^3-a+b*c => ADD[SUB[EXP[EXP[d,ADD[x,5]],3],a],MPY[b,c]]

              ADD
             /   \
          SUB     MPY
         /   \   /   \
      EXP     a b     c
     /   \
    d     EXP     
         /   \
      ADD     3
     /   \
    x     5

这里是一个cc编译器的片段，一个带有c风格注释的SLIC的更新版本。函数类型(语法、令牌、字符类、生成器、PSEUDO或MACHOP)由其id后面的初始语法决定。使用这些自顶向下的解析器，你可以从一个程序定义公式开始:

program = $((declaration            // A program is a sequence of
                                    // declarations terminated by
            |.EOF .STOP)            // End Of File finish & stop compile
           \                        // Backtrack: .EOF failed or
                                    // declaration long-failed.
             (ERRORX["?Error?"]     // report unknown error
                                    // flagging furthest parse point.
              $(-';' (.ANY          // find a ';'. skiping .ANY
                     | .STOP))      // character: .ANY fails on end of file
                                    // so .STOP ends the compile.
                                    // (-';') failing breaks loop.
              ';'));                // Match ';' and continue

declaration =  "#" directive                // Compiler directive.
             | comment                      // skips comment text
             | global        DECLAR[*1]     // Global linkage
             |(id                           // functions starting with an id:
                ( formula    PARSER[*1]     // Parsing formula
                | sequencer  GENERATOR[*1]  // Code generator
                | optimizer  ISO[*1]        // Optimizer
                | pseudo_op  PRODUCTION[*1] // Pseudo instruction
                | emitor_op  MACHOP[*1]     // Machine instruction
                )        // All the above start with an identifier
              \ (ERRORX["Syntax error."]
                 garbol);                    // skip over error.

//注意id在创建树时是如何被分解和组合的。

formula =   ("==" syntax  :BCKTRAK   // backtrack grammar formula
            |'='  syntax  :SYNTAX    // grammar formula.
            |':'  chclass :CLASS     // character class define
            |".." token   :TOKEN     // token formula
              )';' !2                // Combine node name with id 
                                     // parsed in calling declaration 
                                     // formula and tree produced
                                     // by the called syntax, token
                                     // or character class formula.
                $(-(.NL |"/*") (.ANY|.STOP)); Comment ; to line separator?

chclass = +[ letter $('|' letter) ]+;// a simple list of character codes
                                     // except 
letter  = char | number | id;        // when including another class

syntax  = seq ('|' alt1|'\' alt2 |--);

alt1    = seq:ALT!2 ('|' alt1|--);  Non-backtrack alternative sequence.

alt2    = seq:BKTK!2 ('\' alt2|--); backtrack alternative sequence

seq     = +[oper $oper]+;

oper    = test | action | '(' syntax ')' | comment; 

test    = string | id ('[' (arg_list| ,NILL) ']':GENCALL!2|.EMPTY);

action  = ':' id:NODE!1
        | '!' number:MAKTREE!1
        | "+["  seq "]+" :MAKLST!1;

//     C style comments
comment  = "//" $(-.NL .ANY)
         | "/*" $(-"*/" .ANY) "*/";

值得注意的是解析器语言如何处理注释和错误恢复。

我想我已经回答了这个问题。SLICs后继写了很大一部分，cc语言本身就在这里。目前还没有针对它的编译器。但我可以手动将其编译成汇编代码，裸asm c或c++函数。

2019-12-20 23:49:58

这是一个dump(实际上很难搜索的话题):

闲聊 C

这也是PyPy和Rubinius的想法:

(我想这也适用于Forth，但我对Forth一无所知。)

2008-10-11 01:42:12

我记得我听过一个软件工程广播播客，其中Dick Gabriel谈到了如何在纸上用LISP编写一个最简单的版本，然后手工将其组装成机器代码，从而引导最初的LISP解释器。从那时起，其余的LISP特性都是用LISP编写和解释的。

2008-10-13 18:42:57

Mono项目的c#编译器已经“自托管”很长一段时间了，这意味着它是用c#本身编写的。

据我所知，编译器最初是纯C代码，但一旦实现了ECMA的“基本”特性，他们就开始用c#重写编译器。