了解汇编程序的原因之一是,有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而,我也听人说过很多次,尽管这并非完全错误,但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见,并且需要汇编方面的专业知识和经验。
这个问题甚至没有涉及到这样一个事实,即汇编程序指令将是特定于机器的、不可移植的,或者汇编程序的任何其他方面。当然,除了这一点之外,了解汇编还有很多很好的理由,但这是一个需要示例和数据的具体问题,而不是关于汇编程序与高级语言的扩展论述。
谁能提供一些具体的例子,说明使用现代编译器汇编代码比编写良好的C代码更快,并且您能否用分析证据支持这一说法?我相信这些案例确实存在,但我真的很想知道这些案例到底有多深奥,因为这似乎是一个有争议的问题。
在历史上插话。
当我还年轻的时候(20世纪70年代),根据我的经验,汇编是很重要的,更重要的是代码的大小,而不是代码的速度。
如果一个高级语言的模块是1300字节的代码,但该模块的汇编版本是300字节,那么当您试图将应用程序装入16K或32K的内存时,这1K字节就非常重要。
那时候编译器还不是很好。
在老式的Fortran中
X = (Y - Z)
IF (X .LT. 0) THEN
... do something
ENDIF
当时的编译器在X上执行了一个SUBTRACT指令,然后是一个TEST指令。
在汇编程序中,您只需在减法之后检查条件代码(LT零,零,GT零)。
对于现代系统和编译器来说,这些都不是问题。
我认为理解编译器在做什么仍然很重要。
当您使用高级语言编写代码时,您应该了解什么允许或阻止编译器执行循环展开。
当编译器执行“类似分支”的操作时,使用管道内衬和包含条件的前瞻计算。
当执行高级语言不允许的事情时,仍然需要汇编程序,比如读取或写入处理器特定的寄存器。
但在很大程度上,普通程序员不再需要它,除非对代码如何编译和执行有基本的了解。
尽管C语言“接近”于对8位、16位、32位和64位数据的低级操作,但仍有一些C语言不支持的数学操作通常可以在某些汇编指令集中优雅地执行:
Fixed-point multiplication: The product of two 16-bit numbers is a 32-bit number. But the rules in C says that the product of two 16-bit numbers is a 16-bit number, and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases. If you want the top half of a 16x16 multiply or a 32x32 multiply, you have to play games with the compiler. The general method is to cast to a larger-than-necessary bit width, multiply, shift down, and cast back:
int16_t x, y;
// int16_t is a typedef for "short"
// set x and y to something
int16_t prod = (int16_t)(((int32_t)x*y)>>16);`
In this case the compiler may be smart enough to know that you're really just trying to get the top half of a 16x16 multiply and do the right thing with the machine's native 16x16multiply. Or it may be stupid and require a library call to do the 32x32 multiply that's way overkill because you only need 16 bits of the product -- but the C standard doesn't give you any way to express yourself.
Certain bitshifting operations (rotation/carries):
// 256-bit array shifted right in its entirety:
uint8_t x[32];
for (int i = 32; --i > 0; )
{
x[i] = (x[i] >> 1) | (x[i-1] << 7);
}
x[0] >>= 1;
This is not too inelegant in C, but again, unless the compiler is smart enough to realize what you are doing, it's going to do a lot of "unnecessary" work. Many assembly instruction sets allow you to rotate or shift left/right with the result in the carry register, so you could accomplish the above in 34 instructions: load a pointer to the beginning of the array, clear the carry, and perform 32 8-bit right-shifts, using auto-increment on the pointer.
For another example, there are linear feedback shift registers (LFSR) that are elegantly performed in assembly: Take a chunk of N bits (8, 16, 32, 64, 128, etc), shift the whole thing right by 1 (see above algorithm), then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial.
尽管如此,除非有严重的性能限制,否则我不会求助于这些技术。正如其他人所说,汇编代码比C代码更难记录/调试/测试/维护:性能的提高伴随着一些严重的代价。
编辑:3。溢出检测在汇编中是可能的(在C中不能真正做到),这使得一些算法更容易。
在历史上插话。
当我还年轻的时候(20世纪70年代),根据我的经验,汇编是很重要的,更重要的是代码的大小,而不是代码的速度。
如果一个高级语言的模块是1300字节的代码,但该模块的汇编版本是300字节,那么当您试图将应用程序装入16K或32K的内存时,这1K字节就非常重要。
那时候编译器还不是很好。
在老式的Fortran中
X = (Y - Z)
IF (X .LT. 0) THEN
... do something
ENDIF
当时的编译器在X上执行了一个SUBTRACT指令,然后是一个TEST指令。
在汇编程序中,您只需在减法之后检查条件代码(LT零,零,GT零)。
对于现代系统和编译器来说,这些都不是问题。
我认为理解编译器在做什么仍然很重要。
当您使用高级语言编写代码时,您应该了解什么允许或阻止编译器执行循环展开。
当编译器执行“类似分支”的操作时,使用管道内衬和包含条件的前瞻计算。
当执行高级语言不允许的事情时,仍然需要汇编程序,比如读取或写入处理器特定的寄存器。
但在很大程度上,普通程序员不再需要它,除非对代码如何编译和执行有基本的了解。
在我的工作中,有三个原因让我了解和使用组装。按重要性排序:
Debugging - I often get library code that has bugs or incomplete documentation. I figure out what it's doing by stepping in at the assembly level. I have to do this about once a week. I also use it as a tool to debug problems in which my eyes don't spot the idiomatic error in C/C++/C#. Looking at the assembly gets past that.
Optimizing - the compiler does fairly well in optimizing, but I play in a different ballpark than most. I write image processing code that usually starts with code that looks like this:
for (int y=0; y < imageHeight; y++) {
for (int x=0; x < imageWidth; x++) {
// do something
}
}
the "do something part" typically happens on the order of several million times (ie, between 3 and 30). By scraping cycles in that "do something" phase, the performance gains are hugely magnified. I don't usually start there - I usually start by writing the code to work first, then do my best to refactor the C to be naturally better (better algorithm, less load in the loop etc). I usually need to read assembly to see what's going on and rarely need to write it. I do this maybe every two or three months.
doing something the language won't let me. These include - getting the processor architecture and specific processor features, accessing flags not in the CPU (man, I really wish C gave you access to the carry flag), etc. I do this maybe once a year or two years.