了解汇编程序的原因之一是,有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而,我也听人说过很多次,尽管这并非完全错误,但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见,并且需要汇编方面的专业知识和经验。

这个问题甚至没有涉及到这样一个事实,即汇编程序指令将是特定于机器的、不可移植的,或者汇编程序的任何其他方面。当然,除了这一点之外,了解汇编还有很多很好的理由,但这是一个需要示例和数据的具体问题,而不是关于汇编程序与高级语言的扩展论述。

谁能提供一些具体的例子,说明使用现代编译器汇编代码比编写良好的C代码更快,并且您能否用分析证据支持这一说法?我相信这些案例确实存在,但我真的很想知道这些案例到底有多深奥,因为这似乎是一个有争议的问题。


当前回答

Walter Bright的《optimization Immutable and Purity》可能值得一看,它不是一个概要测试,但向您展示了手写和编译器生成ASM之间的区别。Walter Bright写优化编译器,所以值得一看他的其他博客文章。

其他回答

如果您没有查看编译器生成的内容的反汇编,您实际上无法知道编写良好的C代码是否真的很快。很多时候你会发现“写得好”是主观的。

因此,没有必要用汇编程序来获得最快的代码,但出于同样的原因,了解汇编程序当然是值得的。

以下是我个人经历中的几个例子:

Access to instructions that are not accessible from C. For instance, many architectures (like x86-64, IA-64, DEC Alpha, and 64-bit MIPS or PowerPC) support a 64 bit by 64 bit multiplication producing a 128 bit result. GCC recently added an extension providing access to such instructions, but before that assembly was required. And access to this instruction can make a huge difference on 64-bit CPUs when implementing something like RSA - sometimes as much as a factor of 4 improvement in performance. Access to CPU-specific flags. The one that has bitten me a lot is the carry flag; when doing a multiple-precision addition, if you don't have access to the CPU carry bit one must instead compare the result to see if it overflowed, which takes 3-5 more instructions per limb; and worse, which are quite serial in terms of data accesses, which kills performance on modern superscalar processors. When processing thousands of such integers in a row, being able to use addc is a huge win (there are superscalar issues with contention on the carry bit as well, but modern CPUs deal pretty well with it). SIMD. Even autovectorizing compilers can only do relatively simple cases, so if you want good SIMD performance it's unfortunately often necessary to write the code directly. Of course you can use intrinsics instead of assembly but once you're at the intrinsics level you're basically writing assembly anyway, just using the compiler as a register allocator and (nominally) instruction scheduler. (I tend to use intrinsics for SIMD simply because the compiler can generate the function prologues and whatnot for me so I can use the same code on Linux, OS X, and Windows without having to deal with ABI issues like function calling conventions, but other than that the SSE intrinsics really aren't very nice - the Altivec ones seem better though I don't have much experience with them). As examples of things a (current day) vectorizing compiler can't figure out, read about bitslicing AES or SIMD error correction - one could imagine a compiler that could analyze algorithms and generate such code, but it feels to me like such a smart compiler is at least 30 years away from existing (at best).

On the other hand, multicore machines and distributed systems have shifted many of the biggest performance wins in the other direction - get an extra 20% speedup writing your inner loops in assembly, or 300% by running them across multiple cores, or 10000% by running them across a cluster of machines. And of course high level optimizations (things like futures, memoization, etc) are often much easier to do in a higher level language like ML or Scala than C or asm, and often can provide a much bigger performance win. So, as always, there are tradeoffs to be made.

很多年前,我教别人用c语言编程。练习是将图形旋转90度。他得到了一个花了几分钟才能完成的解,主要是因为他使用了乘法和除法等。

我向他展示了如何使用位移位重定义问题,在他拥有的非优化编译器上,处理时间缩短到大约30秒。

我刚刚得到了一个优化编译器,相同的代码在< 5秒内旋转图形。我看着编译器生成的汇编代码,从我所看到的,我决定我写汇编程序的日子结束了。

如今,考虑到像英特尔c++这样的编译器对C代码进行了极大的优化,它很难与编译器的输出竞争。

Walter Bright的《optimization Immutable and Purity》可能值得一看,它不是一个概要测试,但向您展示了手写和编译器生成ASM之间的区别。Walter Bright写优化编译器,所以值得一看他的其他博客文章。