了解汇编程序的原因之一是,有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而,我也听人说过很多次,尽管这并非完全错误,但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见,并且需要汇编方面的专业知识和经验。

这个问题甚至没有涉及到这样一个事实,即汇编程序指令将是特定于机器的、不可移植的,或者汇编程序的任何其他方面。当然,除了这一点之外,了解汇编还有很多很好的理由,但这是一个需要示例和数据的具体问题,而不是关于汇编程序与高级语言的扩展论述。

谁能提供一些具体的例子,说明使用现代编译器汇编代码比编写良好的C代码更快,并且您能否用分析证据支持这一说法?我相信这些案例确实存在,但我真的很想知道这些案例到底有多深奥,因为这似乎是一个有争议的问题。


当前回答

在我的工作中,有三个原因让我了解和使用组装。按重要性排序:

Debugging - I often get library code that has bugs or incomplete documentation. I figure out what it's doing by stepping in at the assembly level. I have to do this about once a week. I also use it as a tool to debug problems in which my eyes don't spot the idiomatic error in C/C++/C#. Looking at the assembly gets past that. Optimizing - the compiler does fairly well in optimizing, but I play in a different ballpark than most. I write image processing code that usually starts with code that looks like this: for (int y=0; y < imageHeight; y++) { for (int x=0; x < imageWidth; x++) { // do something } } the "do something part" typically happens on the order of several million times (ie, between 3 and 30). By scraping cycles in that "do something" phase, the performance gains are hugely magnified. I don't usually start there - I usually start by writing the code to work first, then do my best to refactor the C to be naturally better (better algorithm, less load in the loop etc). I usually need to read assembly to see what's going on and rarely need to write it. I do this maybe every two or three months. doing something the language won't let me. These include - getting the processor architecture and specific processor features, accessing flags not in the CPU (man, I really wish C gave you access to the carry flag), etc. I do this maybe once a year or two years.

其他回答

很多年前,我教别人用c语言编程。练习是将图形旋转90度。他得到了一个花了几分钟才能完成的解,主要是因为他使用了乘法和除法等。

我向他展示了如何使用位移位重定义问题,在他拥有的非优化编译器上,处理时间缩短到大约30秒。

我刚刚得到了一个优化编译器,相同的代码在< 5秒内旋转图形。我看着编译器生成的汇编代码,从我所看到的,我决定我写汇编程序的日子结束了。

长波克,只有一个限制时间。当你没有足够的资源来优化每一个代码的变化,并花时间分配寄存器,优化一些溢出和诸如此类的事情时,编译器每次都会赢。对代码进行修改、重新编译和度量。如有必要重复。

此外,你可以在高水平方面做很多事情。此外,检查生成的程序集可能会给人一种代码是垃圾的印象,但实际上它的运行速度比您想象的要快。例子:

Int y = data[i]; //在这里做一些事情。 call_function (y,…);

编译器将读取数据,将其推入堆栈(溢出),然后从堆栈读取并作为参数传递。听起来屎?它实际上可能是非常有效的延迟补偿,并导致更快的运行时。

//优化版本 call_function(数据[我],…);//毕竟不是那么优化。

优化版本的想法是,我们降低了寄存器压力,避免溢出。但事实上,“垃圾”版本更快!

看看汇编代码,只看指令,然后得出结论:指令越多,速度越慢,这将是一个错误的判断。

这里需要注意的是:许多组装专家认为他们知道很多,但知道的很少。规则也会随着架构的变化而变化。例如,x86代码并不存在总是最快的银弹。如今,最好还是按照经验法则行事:

记忆很慢 缓存速度快 尽量更好地使用缓存 你多久会错过一次?你有延迟补偿策略吗? 对于一个cache miss,你可以执行10-100个ALU/FPU/SSE指令 应用程序架构很重要。 . .但是当问题不在架构上时,它就没有帮助了

此外,过于相信编译器会神奇地将考虑不周到的C/ c++代码转换为“理论上最优”的代码是一厢情愿的想法。如果你关心这个低级别的“性能”,你必须知道你使用的编译器和工具链。

C/ c++中的编译器通常不太擅长重新排序子表达式,因为对于初学者来说,函数有副作用。函数式语言没有受到这个警告的影响,但它不太适合当前的生态系统。有一些编译器选项可以允许宽松的精确规则,允许编译器/链接器/代码生成器改变操作的顺序。

这个话题有点死路一条;对于大多数人来说,这是无关紧要的,而剩下的人,他们已经知道自己在做什么了。

这一切都归结为:“理解你在做什么”,这与知道你在做什么有点不同。

一个更著名的组装片段来自Michael Abrash的纹理映射循环(在这里详细解释):

add edx,[DeltaVFrac] ; add in dVFrac
sbb ebp,ebp ; store carry
mov [edi],al ; write pixel n
mov al,[esi] ; fetch pixel n+1
add ecx,ebx ; add in dUFrac
adc esi,[4*ebp + UVStepVCarry]; add in steps

现在,大多数编译器将高级CPU特定指令表示为intrinsic,即编译为实际指令的函数。MS Visual c++支持MMX、SSE、SSE2、SSE3和SSE4的intrinsic,因此您不必太过担心使用特定于平台的指令来进行汇编。Visual c++还可以通过适当的/ARCH设置来利用您所针对的实际体系结构。

尽管C语言“接近”于对8位、16位、32位和64位数据的低级操作,但仍有一些C语言不支持的数学操作通常可以在某些汇编指令集中优雅地执行:

Fixed-point multiplication: The product of two 16-bit numbers is a 32-bit number. But the rules in C says that the product of two 16-bit numbers is a 16-bit number, and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases. If you want the top half of a 16x16 multiply or a 32x32 multiply, you have to play games with the compiler. The general method is to cast to a larger-than-necessary bit width, multiply, shift down, and cast back: int16_t x, y; // int16_t is a typedef for "short" // set x and y to something int16_t prod = (int16_t)(((int32_t)x*y)>>16);` In this case the compiler may be smart enough to know that you're really just trying to get the top half of a 16x16 multiply and do the right thing with the machine's native 16x16multiply. Or it may be stupid and require a library call to do the 32x32 multiply that's way overkill because you only need 16 bits of the product -- but the C standard doesn't give you any way to express yourself. Certain bitshifting operations (rotation/carries): // 256-bit array shifted right in its entirety: uint8_t x[32]; for (int i = 32; --i > 0; ) { x[i] = (x[i] >> 1) | (x[i-1] << 7); } x[0] >>= 1; This is not too inelegant in C, but again, unless the compiler is smart enough to realize what you are doing, it's going to do a lot of "unnecessary" work. Many assembly instruction sets allow you to rotate or shift left/right with the result in the carry register, so you could accomplish the above in 34 instructions: load a pointer to the beginning of the array, clear the carry, and perform 32 8-bit right-shifts, using auto-increment on the pointer. For another example, there are linear feedback shift registers (LFSR) that are elegantly performed in assembly: Take a chunk of N bits (8, 16, 32, 64, 128, etc), shift the whole thing right by 1 (see above algorithm), then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial.

尽管如此,除非有严重的性能限制,否则我不会求助于这些技术。正如其他人所说,汇编代码比C代码更难记录/调试/测试/维护:性能的提高伴随着一些严重的代价。

编辑:3。溢出检测在汇编中是可能的(在C中不能真正做到),这使得一些算法更容易。

我已经阅读了所有的答案(超过30个),并没有找到一个简单的原因:如果你读过并练习过Intel®64和IA-32架构优化参考手册,汇编程序比C更快,所以汇编程序可能更慢的原因是编写这种慢汇编程序的人没有阅读优化手册。

In the good old days of Intel 80286, each instruction was executed at a fixed count of CPU cycles. Still, since Pentium Pro, released in 1995, Intel processors became superscalar, utilizing Complex Pipelining: Out-of-Order Execution & Register Renaming. Before that, on Pentium, produced in 1993, there were U and V pipelines. Therefore, Pentium introduced dual pipelines that could execute two simple instructions at one clock cycle if they didn't depend on one another. However, this was nothing compared with the Out-of-Order Execution & Register Renaming that appeared in Pentium Pro. This approach introduced in Pentium Pro is practically the same nowadays on most recent Intel processors.

Let me explain the Out-of-Order Execution in a few words. The fastest code is where instructions do not depend on previous results, e.g., you should always clear whole registers (by movzx) to remove dependency from previous values of the registers you are working with, so they may be renamed internally by the CPU to allow instruction execute in parallel or in a different order. Or, on some processors, false dependency may exist that may also slow things down, like false dependency on Pentium 4 for inc/dec, so you may wish to use add eax, 1 instead or inc eax to remove dependency on the previous state of the flags.

如果时间允许,您可以阅读更多无序执行和注册重命名。因特网上有大量的信息。

There are also many other essential issues like branch prediction, number of load and store units, number of gates that execute micro-ops, memory cache coherence protocols, etc., but the crucial thing to consider is the Out-of-Order Execution. Most people are simply not aware of the Out-of-Order Execution. Therefore, they write their assembly programs like for 80286, expecting their instructions will take a fixed time to execute regardless of the context. At the same time, C compilers are aware of the Out-of-Order Execution and generate the code correctly. That's why the code of such uninformed people is slower, but if you become knowledgeable, your code will be faster.

除了乱序执行之外,还有很多优化技巧和技巧。请阅读上面提到的优化手册:-)

However, assembly language has its own drawbacks when it comes to optimization. According to Peter Cordes (see the comment below), some of the optimizations compilers do would be unmaintainable for large code-bases in hand-written assembly. For example, suppose you write in assembly. In that case, you need to completely change an inline function (an assembly macro) when it inlines into a function that calls it with some arguments being constants. At the same time, a C compiler makes its job a lot simpler—and inlining the same code in different ways into different call sites. There is a limit to what you can do with assembly macros. So to get the same benefit, you'd have to manually optimize the same logic in each place to match the constants and available registers you have.