了解汇编程序的原因之一是,有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而,我也听人说过很多次,尽管这并非完全错误,但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见,并且需要汇编方面的专业知识和经验。

这个问题甚至没有涉及到这样一个事实,即汇编程序指令将是特定于机器的、不可移植的,或者汇编程序的任何其他方面。当然,除了这一点之外,了解汇编还有很多很好的理由,但这是一个需要示例和数据的具体问题,而不是关于汇编程序与高级语言的扩展论述。

谁能提供一些具体的例子,说明使用现代编译器汇编代码比编写良好的C代码更快,并且您能否用分析证据支持这一说法?我相信这些案例确实存在,但我真的很想知道这些案例到底有多深奥,因为这似乎是一个有争议的问题。


当前回答

如今,考虑到像英特尔c++这样的编译器对C代码进行了极大的优化,它很难与编译器的输出竞争。

其他回答

Walter Bright的《optimization Immutable and Purity》可能值得一看,它不是一个概要测试,但向您展示了手写和编译器生成ASM之间的区别。Walter Bright写优化编译器,所以值得一看他的其他博客文章。

我想说的是,当你比编译器更擅长一组给定的指令时。所以我认为没有通用的答案

尽管C语言“接近”于对8位、16位、32位和64位数据的低级操作,但仍有一些C语言不支持的数学操作通常可以在某些汇编指令集中优雅地执行:

Fixed-point multiplication: The product of two 16-bit numbers is a 32-bit number. But the rules in C says that the product of two 16-bit numbers is a 16-bit number, and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases. If you want the top half of a 16x16 multiply or a 32x32 multiply, you have to play games with the compiler. The general method is to cast to a larger-than-necessary bit width, multiply, shift down, and cast back: int16_t x, y; // int16_t is a typedef for "short" // set x and y to something int16_t prod = (int16_t)(((int32_t)x*y)>>16);` In this case the compiler may be smart enough to know that you're really just trying to get the top half of a 16x16 multiply and do the right thing with the machine's native 16x16multiply. Or it may be stupid and require a library call to do the 32x32 multiply that's way overkill because you only need 16 bits of the product -- but the C standard doesn't give you any way to express yourself. Certain bitshifting operations (rotation/carries): // 256-bit array shifted right in its entirety: uint8_t x[32]; for (int i = 32; --i > 0; ) { x[i] = (x[i] >> 1) | (x[i-1] << 7); } x[0] >>= 1; This is not too inelegant in C, but again, unless the compiler is smart enough to realize what you are doing, it's going to do a lot of "unnecessary" work. Many assembly instruction sets allow you to rotate or shift left/right with the result in the carry register, so you could accomplish the above in 34 instructions: load a pointer to the beginning of the array, clear the carry, and perform 32 8-bit right-shifts, using auto-increment on the pointer. For another example, there are linear feedback shift registers (LFSR) that are elegantly performed in assembly: Take a chunk of N bits (8, 16, 32, 64, 128, etc), shift the whole thing right by 1 (see above algorithm), then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial.

尽管如此,除非有严重的性能限制,否则我不会求助于这些技术。正如其他人所说,汇编代码比C代码更难记录/调试/测试/维护:性能的提高伴随着一些严重的代价。

编辑:3。溢出检测在汇编中是可能的(在C中不能真正做到),这使得一些算法更容易。

http://cr.yp.to/qhasm.html有很多例子。

The question is a bit misleading. The answer is there in your post itself. It is always possible to write assembly solution for a particular problem which executes faster than any generated by a compiler. The thing is you need to be an expert in assembly to overcome the limitations of a compiler. An experienced assembly programmer can write programs in any HLL which performs faster than one written by an inexperienced. The truth is you can always write assembly programs executing faster than one generated by a compiler.