了解汇编程序的原因之一是,有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而,我也听人说过很多次,尽管这并非完全错误,但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见,并且需要汇编方面的专业知识和经验。

这个问题甚至没有涉及到这样一个事实,即汇编程序指令将是特定于机器的、不可移植的,或者汇编程序的任何其他方面。当然,除了这一点之外,了解汇编还有很多很好的理由,但这是一个需要示例和数据的具体问题,而不是关于汇编程序与高级语言的扩展论述。

谁能提供一些具体的例子,说明使用现代编译器汇编代码比编写良好的C代码更快,并且您能否用分析证据支持这一说法?我相信这些案例确实存在,但我真的很想知道这些案例到底有多深奥,因为这似乎是一个有争议的问题。


当前回答

如今,考虑到像英特尔c++这样的编译器对C代码进行了极大的优化,它很难与编译器的输出竞争。

其他回答

使用SIMD指令的矩阵操作可能比编译器生成的代码更快。

我不能给出具体的例子,因为那是很多年前的事情了,但是在很多情况下,手工编写的汇编程序可以胜过任何编译器。原因:

您可以偏离调用约定,在寄存器中传递参数。 您可以仔细考虑如何使用寄存器,避免将变量存储在内存中。 对于跳转表之类的东西,可以避免检查索引的边界。

基本上,编译器在优化方面做得很好,这几乎总是“足够好”,但在某些情况下(如图形渲染),你要为每一个周期付出高昂的代价,你可以走捷径,因为你知道代码,而编译器不能,因为它必须在安全的方面。

事实上,我听说过一些图形渲染代码,其中一个例程,如直线绘制或多边形填充例程,实际上在堆栈上生成了一小块机器代码并在那里执行,以避免关于线条样式、宽度、模式等的连续决策。

也就是说,我想让编译器为我生成好的汇编代码,但又不太聪明,它们通常都是这样做的。事实上,我讨厌Fortran的一个原因是它为了“优化”而打乱代码,通常没有什么重要的目的。

通常,当应用程序出现性能问题时,都是由于浪费的设计造成的。这些天,我永远不会推荐汇编程序的性能,除非整个应用程序已经在它的生命周期内进行了调优,仍然不够快,并且把所有的时间都花在了紧凑的内部循环中。

补充:我见过很多用汇编语言编写的应用程序,与C、Pascal、Fortran等语言相比,汇编语言的主要速度优势是因为程序员在用汇编语言编码时要谨慎得多。他或她每天要写大约100行代码,不管哪种语言,在编译器语言中,这将等于3或400条指令。

一个更著名的组装片段来自Michael Abrash的纹理映射循环(在这里详细解释):

add edx,[DeltaVFrac] ; add in dVFrac
sbb ebp,ebp ; store carry
mov [edi],al ; write pixel n
mov al,[esi] ; fetch pixel n+1
add ecx,ebx ; add in dUFrac
adc esi,[4*ebp + UVStepVCarry]; add in steps

现在,大多数编译器将高级CPU特定指令表示为intrinsic,即编译为实际指令的函数。MS Visual c++支持MMX、SSE、SSE2、SSE3和SSE4的intrinsic,因此您不必太过担心使用特定于平台的指令来进行汇编。Visual c++还可以通过适当的/ARCH设置来利用您所针对的实际体系结构。

如果您没有查看编译器生成的内容的反汇编,您实际上无法知道编写良好的C代码是否真的很快。很多时候你会发现“写得好”是主观的。

因此,没有必要用汇编程序来获得最快的代码,但出于同样的原因,了解汇编程序当然是值得的。

我已经阅读了所有的答案(超过30个),并没有找到一个简单的原因:如果你读过并练习过Intel®64和IA-32架构优化参考手册,汇编程序比C更快,所以汇编程序可能更慢的原因是编写这种慢汇编程序的人没有阅读优化手册。

In the good old days of Intel 80286, each instruction was executed at a fixed count of CPU cycles. Still, since Pentium Pro, released in 1995, Intel processors became superscalar, utilizing Complex Pipelining: Out-of-Order Execution & Register Renaming. Before that, on Pentium, produced in 1993, there were U and V pipelines. Therefore, Pentium introduced dual pipelines that could execute two simple instructions at one clock cycle if they didn't depend on one another. However, this was nothing compared with the Out-of-Order Execution & Register Renaming that appeared in Pentium Pro. This approach introduced in Pentium Pro is practically the same nowadays on most recent Intel processors.

Let me explain the Out-of-Order Execution in a few words. The fastest code is where instructions do not depend on previous results, e.g., you should always clear whole registers (by movzx) to remove dependency from previous values of the registers you are working with, so they may be renamed internally by the CPU to allow instruction execute in parallel or in a different order. Or, on some processors, false dependency may exist that may also slow things down, like false dependency on Pentium 4 for inc/dec, so you may wish to use add eax, 1 instead or inc eax to remove dependency on the previous state of the flags.

如果时间允许,您可以阅读更多无序执行和注册重命名。因特网上有大量的信息。

There are also many other essential issues like branch prediction, number of load and store units, number of gates that execute micro-ops, memory cache coherence protocols, etc., but the crucial thing to consider is the Out-of-Order Execution. Most people are simply not aware of the Out-of-Order Execution. Therefore, they write their assembly programs like for 80286, expecting their instructions will take a fixed time to execute regardless of the context. At the same time, C compilers are aware of the Out-of-Order Execution and generate the code correctly. That's why the code of such uninformed people is slower, but if you become knowledgeable, your code will be faster.

除了乱序执行之外,还有很多优化技巧和技巧。请阅读上面提到的优化手册:-)

However, assembly language has its own drawbacks when it comes to optimization. According to Peter Cordes (see the comment below), some of the optimizations compilers do would be unmaintainable for large code-bases in hand-written assembly. For example, suppose you write in assembly. In that case, you need to completely change an inline function (an assembly macro) when it inlines into a function that calls it with some arguments being constants. At the same time, a C compiler makes its job a lot simpler—and inlining the same code in different ways into different call sites. There is a limit to what you can do with assembly macros. So to get the same benefit, you'd have to manually optimize the same logic in each place to match the constants and available registers you have.