了解汇编程序的原因之一是,有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而,我也听人说过很多次,尽管这并非完全错误,但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见,并且需要汇编方面的专业知识和经验。

这个问题甚至没有涉及到这样一个事实,即汇编程序指令将是特定于机器的、不可移植的,或者汇编程序的任何其他方面。当然,除了这一点之外,了解汇编还有很多很好的理由,但这是一个需要示例和数据的具体问题,而不是关于汇编程序与高级语言的扩展论述。

谁能提供一些具体的例子,说明使用现代编译器汇编代码比编写良好的C代码更快,并且您能否用分析证据支持这一说法?我相信这些案例确实存在,但我真的很想知道这些案例到底有多深奥,因为这似乎是一个有争议的问题。


当前回答

只有在使用编译器不支持的特殊用途指令集时。

为了最大限度地利用具有多个管道和预测分支的现代CPU的计算能力,您需要以这样一种方式来构造汇编程序:a)人类几乎不可能编写b)甚至更不可能维护。

此外,更好的算法、数据结构和内存管理将为您提供至少一个数量级的性能,而不是在汇编中进行的微观优化。

其他回答

如今,考虑到像英特尔c++这样的编译器对C代码进行了极大的优化,它很难与编译器的输出竞争。

我已经阅读了所有的答案(超过30个),并没有找到一个简单的原因:如果你读过并练习过Intel®64和IA-32架构优化参考手册,汇编程序比C更快,所以汇编程序可能更慢的原因是编写这种慢汇编程序的人没有阅读优化手册。

In the good old days of Intel 80286, each instruction was executed at a fixed count of CPU cycles. Still, since Pentium Pro, released in 1995, Intel processors became superscalar, utilizing Complex Pipelining: Out-of-Order Execution & Register Renaming. Before that, on Pentium, produced in 1993, there were U and V pipelines. Therefore, Pentium introduced dual pipelines that could execute two simple instructions at one clock cycle if they didn't depend on one another. However, this was nothing compared with the Out-of-Order Execution & Register Renaming that appeared in Pentium Pro. This approach introduced in Pentium Pro is practically the same nowadays on most recent Intel processors.

Let me explain the Out-of-Order Execution in a few words. The fastest code is where instructions do not depend on previous results, e.g., you should always clear whole registers (by movzx) to remove dependency from previous values of the registers you are working with, so they may be renamed internally by the CPU to allow instruction execute in parallel or in a different order. Or, on some processors, false dependency may exist that may also slow things down, like false dependency on Pentium 4 for inc/dec, so you may wish to use add eax, 1 instead or inc eax to remove dependency on the previous state of the flags.

如果时间允许,您可以阅读更多无序执行和注册重命名。因特网上有大量的信息。

There are also many other essential issues like branch prediction, number of load and store units, number of gates that execute micro-ops, memory cache coherence protocols, etc., but the crucial thing to consider is the Out-of-Order Execution. Most people are simply not aware of the Out-of-Order Execution. Therefore, they write their assembly programs like for 80286, expecting their instructions will take a fixed time to execute regardless of the context. At the same time, C compilers are aware of the Out-of-Order Execution and generate the code correctly. That's why the code of such uninformed people is slower, but if you become knowledgeable, your code will be faster.

除了乱序执行之外,还有很多优化技巧和技巧。请阅读上面提到的优化手册:-)

However, assembly language has its own drawbacks when it comes to optimization. According to Peter Cordes (see the comment below), some of the optimizations compilers do would be unmaintainable for large code-bases in hand-written assembly. For example, suppose you write in assembly. In that case, you need to completely change an inline function (an assembly macro) when it inlines into a function that calls it with some arguments being constants. At the same time, a C compiler makes its job a lot simpler—and inlining the same code in different ways into different call sites. There is a limit to what you can do with assembly macros. So to get the same benefit, you'd have to manually optimize the same logic in each place to match the constants and available registers you have.

以下是我个人经历中的几个例子:

Access to instructions that are not accessible from C. For instance, many architectures (like x86-64, IA-64, DEC Alpha, and 64-bit MIPS or PowerPC) support a 64 bit by 64 bit multiplication producing a 128 bit result. GCC recently added an extension providing access to such instructions, but before that assembly was required. And access to this instruction can make a huge difference on 64-bit CPUs when implementing something like RSA - sometimes as much as a factor of 4 improvement in performance. Access to CPU-specific flags. The one that has bitten me a lot is the carry flag; when doing a multiple-precision addition, if you don't have access to the CPU carry bit one must instead compare the result to see if it overflowed, which takes 3-5 more instructions per limb; and worse, which are quite serial in terms of data accesses, which kills performance on modern superscalar processors. When processing thousands of such integers in a row, being able to use addc is a huge win (there are superscalar issues with contention on the carry bit as well, but modern CPUs deal pretty well with it). SIMD. Even autovectorizing compilers can only do relatively simple cases, so if you want good SIMD performance it's unfortunately often necessary to write the code directly. Of course you can use intrinsics instead of assembly but once you're at the intrinsics level you're basically writing assembly anyway, just using the compiler as a register allocator and (nominally) instruction scheduler. (I tend to use intrinsics for SIMD simply because the compiler can generate the function prologues and whatnot for me so I can use the same code on Linux, OS X, and Windows without having to deal with ABI issues like function calling conventions, but other than that the SSE intrinsics really aren't very nice - the Altivec ones seem better though I don't have much experience with them). As examples of things a (current day) vectorizing compiler can't figure out, read about bitslicing AES or SIMD error correction - one could imagine a compiler that could analyze algorithms and generate such code, but it feels to me like such a smart compiler is at least 30 years away from existing (at best).

On the other hand, multicore machines and distributed systems have shifted many of the biggest performance wins in the other direction - get an extra 20% speedup writing your inner loops in assembly, or 300% by running them across multiple cores, or 10000% by running them across a cluster of machines. And of course high level optimizations (things like futures, memoization, etc) are often much easier to do in a higher level language like ML or Scala than C or asm, and often can provide a much bigger performance win. So, as always, there are tradeoffs to be made.

GCC已经成为广泛使用的编译器。它的优化通常不是很好。比编写汇编程序的普通程序员好得多,但就实际性能而言,并没有那么好。有些编译器产生的代码简直令人难以置信。所以一般来说,有很多地方你可以进入编译器的输出,调整汇编器的性能,和/或简单地从头重写例程。

使用SIMD指令的矩阵操作可能比编译器生成的代码更快。