编译和链接过程是如何工作的?

(注意:这是Stack Overflow的c++常见问题解答的一个条目。如果你想批评在这个表单中提供FAQ的想法,那么在meta上开始这一切的帖子将是这样做的地方。这个问题的答案在c++聊天室中被监控,FAQ的想法最初就是在那里开始的,所以你的答案很可能会被想出这个想法的人读到。)


当前回答

The skinny is that a CPU loads data from memory addresses, stores data to memory addresses, and execute instructions sequentially out of memory addresses, with some conditional jumps in the sequence of instructions processed. Each of these three categories of instructions involves computing an address to a memory cell to be used in the machine instruction. Because machine instructions are of a variable length depending on the particular instruction involved, and because we string a variable length of them together as we build our machine code, there is a two step process involved in calculating and building any addresses.

First we laying out the allocation of memory as best we can before we can know what exactly goes in each cell. We figure out the bytes, or words, or whatever that form the instructions and literals and any data. We just start allocating memory and building the values that will create the program as we go, and note down anyplace we need to go back and fix an address. In that place we put a dummy to just pad the location so we can continue to calculate memory size. For example our first machine code might take one cell. The next machine code might take 3 cells, involving one machine code cell and two address cells. Now our address pointer is 4. We know what goes in the machine cell, which is the op code, but we have to wait to calculate what goes in the address cells till we know where that data will be located, i.e. what will be the machine address of that data.

如果只有一个源文件,理论上编译器可以在没有链接器的情况下生成完全可执行的机器代码。在一个两次传递的过程中,它可以计算任何机器加载或存储指令所引用的所有数据单元的所有实际地址。它可以计算出任何绝对跳转指令所引用的所有绝对地址。这就是简单的编译器的工作方式,比如Forth中的编译器,没有链接器。

链接器是允许代码块单独编译的东西。这可以加快构建代码的整体过程,并允许以后如何使用这些块具有一定的灵活性,换句话说,它们可以在内存中重新定位,例如向每个地址添加1000个地址单元,将块向上移动1000个地址单元。

编译器输出的是尚未完全构建的粗略机器码,但已经布局好了,这样我们就可以知道所有内容的大小,换句话说,这样我们就可以开始计算所有绝对地址的位置。编译器还输出一个符号列表,这些符号是名称/地址对。这些符号将模块中机器代码中的内存偏移量与名称关联起来。偏移量是到模块中符号的内存位置的绝对距离。

这就是我们找到连接器的地方。链接器首先将所有这些机器码块首尾相连,并记录下每个机器码块的起始位置。然后,它通过将模块内的相对偏移量和模块在更大布局中的绝对位置相加来计算要固定的地址。

很明显,我把它简化了,所以你可以试着去理解它,我故意没有使用目标文件、符号表等术语,这对我来说是混乱的一部分。

其他回答

在标准的正面:

翻译单元是源文件、包含的头文件和源文件减去被条件包含预处理器指令跳过的任何源行的组合。 该标准定义了翻译的9个阶段。前四个对应预处理,接下来的三个是编译,下一个是模板的实例化(生成实例化单元),最后一个是链接。

在实践中,第八阶段(模板的实例化)通常在编译过程中完成,但有些编译器将其延迟到链接阶段,有些编译器将其分散到两个阶段。

c++程序的编译包括三个步骤:

预处理:预处理器接受一个c++源代码文件,处理#includes, # definitions和其他预处理器指令。这一步的输出是一个“纯”c++文件,没有预处理器指令。 编译:编译器接受预处理器的输出并从中生成一个目标文件。 链接:链接器获取编译器生成的目标文件,并生成库或可执行文件。

预处理

预处理器处理预处理器指令,如#include和#define。它与c++的语法无关,这就是为什么必须小心使用它的原因。

它一次工作在一个c++源文件上,通过将#include指令替换为各自文件的内容(通常只是声明),替换宏(#define),并根据#if, #ifdef和#ifndef指令选择文本的不同部分。

预处理程序在预处理令牌流上工作。宏替换被定义为用其他标记替换标记(操作符##允许在有意义时合并两个标记)。

在所有这些操作之后,预处理器产生一个输出,该输出是由上述转换产生的令牌流。它还添加了一些特殊的标记,告诉编译器每一行来自哪里,以便它可以使用这些标记生成合理的错误消息。

在这个阶段,通过巧妙地使用#if和#error指令可以产生一些错误。

编译

The compilation step is performed on each output of the preprocessor. The compiler parses the pure C++ source code (now without any preprocessor directives) and converts it into assembly code. Then invokes underlying back-end(assembler in toolchain) that assembles that code into machine code producing actual binary file in some format(ELF, COFF, a.out, ...). This object file contains the compiled code (in binary form) of the symbols defined in the input. Symbols in object files are referred to by name.

目标文件可以引用未定义的符号。当您使用声明而不为其提供定义时,就是这种情况。编译器并不介意这一点,只要源代码格式良好,它就会愉快地生成目标文件。

编译器通常允许您在此时停止编译。这非常有用,因为使用它可以单独编译每个源代码文件。这样做的好处是,如果只更改了一个文件,就不需要重新编译所有文件。

生成的目标文件可以放在称为静态库的特殊归档中,以便以后更容易重用。

在这个阶段,会报告“常规”编译器错误,如语法错误或失败的重载解析错误。

链接

链接器从编译器产生的目标文件中产生最终的编译输出。此输出可以是共享(或动态)库(虽然名称相似,但它们与前面提到的静态库没有太多共同之处)或可执行文件。

它通过用正确的地址替换对未定义符号的引用来链接所有目标文件。这些符号中的每一个都可以在其他目标文件或库中定义。如果它们是在标准库以外的库中定义的,则需要将它们告诉链接器。

在这个阶段,最常见的错误是缺少定义或重复定义。前者意味着要么定义不存在(即它们没有被编写),要么它们所在的目标文件或库没有给链接器。后者是显而易见的:在两个不同的目标文件或库中定义了相同的符号。

这个话题可以在CProgramming.com上讨论: https://www.cprogramming.com/compilingandlinking.html

作者是这样写的:

Compiling isn't quite the same as creating an executable file! Instead, creating an executable is a multistage process divided into two components: compilation and linking. In reality, even if a program "compiles fine" it might not actually work because of errors during the linking phase. The total process of going from source code files to an executable might better be referred to as a build. Compilation Compilation refers to the processing of source code files (.c, .cc, or .cpp) and the creation of an 'object' file. This step doesn't create anything the user can actually run. Instead, the compiler merely produces the machine language instructions that correspond to the source code file that was compiled. For instance, if you compile (but don't link) three separate files, you will have three object files created as output, each with the name .o or .obj (the extension will depend on your compiler). Each of these files contains a translation of your source code file into a machine language file -- but you can't run them yet! You need to turn them into executables your operating system can use. That's where the linker comes in. Linking Linking refers to the creation of a single executable file from multiple object files. In this step, it is common that the linker will complain about undefined functions (commonly, main itself). During compilation, if the compiler could not find the definition for a particular function, it would just assume that the function was defined in another file. If this isn't the case, there's no way the compiler would know -- it doesn't look at the contents of more than one file at a time. The linker, on the other hand, may look at multiple files and try to find references for the functions that weren't mentioned. You might ask why there are separate compilation and linking steps. First, it's probably easier to implement things that way. The compiler does its thing, and the linker does its thing -- by keeping the functions separate, the complexity of the program is reduced. Another (more obvious) advantage is that this allows the creation of large programs without having to redo the compilation step every time a file is changed. Instead, using so called "conditional compilation", it is necessary to compile only those source files that have changed; for the rest, the object files are sufficient input for the linker. Finally, this makes it simple to implement libraries of pre-compiled code: just create object files and link them just like any other object file. (The fact that each file is compiled separately from information contained in other files, incidentally, is called the "separate compilation model".) To get the full benefits of condition compilation, it's probably easier to get a program to help you than to try and remember which files you've changed since you last compiled. (You could, of course, just recompile every file that has a timestamp greater than the timestamp of the corresponding object file.) If you're working with an integrated development environment (IDE) it may already take care of this for you. If you're using command line tools, there's a nifty utility called make that comes with most *nix distributions. Along with conditional compilation, it has several other nice features for programming, such as allowing different compilations of your program -- for instance, if you have a version producing verbose output for debugging. Knowing the difference between the compilation phase and the link phase can make it easier to hunt for bugs. Compiler errors are usually syntactic in nature -- a missing semicolon, an extra parenthesis. Linking errors usually have to do with missing or multiple definitions. If you get an error that a function or variable is defined multiple times from the linker, that's a good indication that the error is that two of your source code files have the same function or variable.

The skinny is that a CPU loads data from memory addresses, stores data to memory addresses, and execute instructions sequentially out of memory addresses, with some conditional jumps in the sequence of instructions processed. Each of these three categories of instructions involves computing an address to a memory cell to be used in the machine instruction. Because machine instructions are of a variable length depending on the particular instruction involved, and because we string a variable length of them together as we build our machine code, there is a two step process involved in calculating and building any addresses.

First we laying out the allocation of memory as best we can before we can know what exactly goes in each cell. We figure out the bytes, or words, or whatever that form the instructions and literals and any data. We just start allocating memory and building the values that will create the program as we go, and note down anyplace we need to go back and fix an address. In that place we put a dummy to just pad the location so we can continue to calculate memory size. For example our first machine code might take one cell. The next machine code might take 3 cells, involving one machine code cell and two address cells. Now our address pointer is 4. We know what goes in the machine cell, which is the op code, but we have to wait to calculate what goes in the address cells till we know where that data will be located, i.e. what will be the machine address of that data.

如果只有一个源文件,理论上编译器可以在没有链接器的情况下生成完全可执行的机器代码。在一个两次传递的过程中,它可以计算任何机器加载或存储指令所引用的所有数据单元的所有实际地址。它可以计算出任何绝对跳转指令所引用的所有绝对地址。这就是简单的编译器的工作方式,比如Forth中的编译器,没有链接器。

链接器是允许代码块单独编译的东西。这可以加快构建代码的整体过程,并允许以后如何使用这些块具有一定的灵活性,换句话说,它们可以在内存中重新定位,例如向每个地址添加1000个地址单元,将块向上移动1000个地址单元。

编译器输出的是尚未完全构建的粗略机器码,但已经布局好了,这样我们就可以知道所有内容的大小,换句话说,这样我们就可以开始计算所有绝对地址的位置。编译器还输出一个符号列表,这些符号是名称/地址对。这些符号将模块中机器代码中的内存偏移量与名称关联起来。偏移量是到模块中符号的内存位置的绝对距离。

这就是我们找到连接器的地方。链接器首先将所有这些机器码块首尾相连,并记录下每个机器码块的起始位置。然后,它通过将模块内的相对偏移量和模块在更大布局中的绝对位置相加来计算要固定的地址。

很明显,我把它简化了,所以你可以试着去理解它,我故意没有使用目标文件、符号表等术语,这对我来说是混乱的一部分。

GCC通过4个步骤将C/ c++程序编译为可执行程序。

例如,gcc -o hello hello.c的执行如下:

1. 预处理

通过GNU C预处理器(pcp .exe)进行预处理,其中包括 头文件(#include)和扩展宏(#define)。

CPP你好。c >你好

生成的中间文件“hello。I”包含扩展源代码。

2. 编译

编译器将预处理的源代码编译为特定处理器的程序集代码。

gcc -S hello.i

-S选项指定生成汇编代码,而不是目标代码。生成的程序集文件是“hello.s”。

3.组装

汇编程序(as.exe)将汇编代码转换为目标文件“hello.o”中的机器代码。

作为-o你好。o hello.s

4. 链接器

最后,链接器(ld.exe)将目标代码与库代码链接起来,以生成一个可执行文件“hello”。

Ld -o你好你好。阿库…