了解汇编程序的原因之一是,有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而,我也听人说过很多次,尽管这并非完全错误,但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见,并且需要汇编方面的专业知识和经验。

这个问题甚至没有涉及到这样一个事实,即汇编程序指令将是特定于机器的、不可移植的,或者汇编程序的任何其他方面。当然,除了这一点之外,了解汇编还有很多很好的理由,但这是一个需要示例和数据的具体问题,而不是关于汇编程序与高级语言的扩展论述。

谁能提供一些具体的例子,说明使用现代编译器汇编代码比编写良好的C代码更快,并且您能否用分析证据支持这一说法?我相信这些案例确实存在,但我真的很想知道这些案例到底有多深奥,因为这似乎是一个有争议的问题。


当前回答

我想说的是,当你比编译器更擅长一组给定的指令时。所以我认为没有通用的答案

其他回答

我很惊讶居然没人这么说。如果用汇编编写strlen()函数,速度会快得多!在C中,你能做的最好的事情就是

int c;
for(c = 0; str[c] != '\0'; c++) {}

在组装过程中,你可以大大加快速度:

mov esi, offset string
mov edi, esi
xor ecx, ecx

lp:
mov ax, byte ptr [esi]
cmp al, cl
je  end_1
cmp ah, cl
je end_2
mov bx, byte ptr [esi + 2]
cmp bl, cl
je end_3
cmp bh, cl
je end_4
add esi, 4
jmp lp

end_4:
inc esi

end_3:
inc esi

end_2:
inc esi

end_1:
inc esi

mov ecx, esi
sub ecx, edi

长度单位是ecx。这一次比较4个字符,所以速度快4倍。并且考虑使用eax和ebx的高阶词,它将比之前的C例程快8倍!

在Amiga上,CPU和图形/音频芯片会为了访问特定区域的RAM(具体来说是前2MB的RAM)而争斗。因此,当你只有2MB RAM(或更少)时,显示复杂的图形加上播放声音会杀死CPU的性能。

在汇编程序中,你可以巧妙地交错你的代码,使CPU只在图形/音频芯片内部繁忙时(即当总线空闲时)才尝试访问RAM。因此,通过重新排序指令,巧妙地使用CPU缓存,总线定时,你可以实现一些使用任何高级语言都不可能实现的效果,因为你必须为每个命令定时,甚至在这里或那里插入nop,以使不同的芯片不受彼此的雷达影响。

这也是为什么CPU的NOP (No Operation -什么都不做)指令实际上可以让你的整个应用程序运行得更快的另一个原因。

当然,这种技术取决于特定的硬件设置。这就是为什么许多Amiga游戏无法适应更快的cpu的主要原因:指令的计时错误。

几乎任何时候编译器看到浮点代码,如果你使用的是旧的糟糕的编译器,手写的版本会更快。(2019年更新:对于现代编译器来说,这并不普遍。特别是在编译x87以外的东西时;编译器更容易使用SSE2或AVX进行标量数学运算,或任何具有平面FP寄存器集的非x86,不像x87的寄存器堆栈。)

主要原因是编译器不能执行任何健壮的优化。关于这个主题的讨论,请参阅来自MSDN的这篇文章。下面是一个例子,其中汇编版本的速度是C版本的两倍(用VS2K5编译):

#include "stdafx.h"
#include <windows.h>

float KahanSum(const float *data, int n)
{
   float sum = 0.0f, C = 0.0f, Y, T;

   for (int i = 0 ; i < n ; ++i) {
      Y = *data++ - C;
      T = sum + Y;
      C = T - sum - Y;
      sum = T;
   }

   return sum;
}

float AsmSum(const float *data, int n)
{
  float result = 0.0f;

  _asm
  {
    mov esi,data
    mov ecx,n
    fldz
    fldz
l1:
    fsubr [esi]
    add esi,4
    fld st(0)
    fadd st(0),st(2)
    fld st(0)
    fsub st(0),st(3)
    fsub st(0),st(2)
    fstp st(2)
    fstp st(2)
    loop l1
    fstp result
    fstp result
  }

  return result;
}

int main (int, char **)
{
  int count = 1000000;

  float *source = new float [count];

  for (int i = 0 ; i < count ; ++i) {
    source [i] = static_cast <float> (rand ()) / static_cast <float> (RAND_MAX);
  }

  LARGE_INTEGER start, mid, end;

  float sum1 = 0.0f, sum2 = 0.0f;

  QueryPerformanceCounter (&start);

  sum1 = KahanSum (source, count);

  QueryPerformanceCounter (&mid);

  sum2 = AsmSum (source, count);

  QueryPerformanceCounter (&end);

  cout << "  C code: " << sum1 << " in " << (mid.QuadPart - start.QuadPart) << endl;
  cout << "asm code: " << sum2 << " in " << (end.QuadPart - mid.QuadPart) << endl;

  return 0;
}

和一些数字从我的PC运行默认版本*:

  C code: 500137 in 103884668
asm code: 500137 in 52129147

出于兴趣,我用dec/jnz交换了循环,它对计时没有影响——有时更快,有时更慢。我想内存有限的方面使其他优化相形见绌。(编者注:更可能的情况是,FP延迟瓶颈足以隐藏循环的额外成本。对奇数/偶数元素并行进行两个Kahan求和,并在最后添加它们,可能会加快2倍的速度。)

哎呀,我正在运行一个稍微不同的代码版本,它输出的数字是错误的(即C更快!)修正并更新了结果。

我已经阅读了所有的答案(超过30个),并没有找到一个简单的原因:如果你读过并练习过Intel®64和IA-32架构优化参考手册,汇编程序比C更快,所以汇编程序可能更慢的原因是编写这种慢汇编程序的人没有阅读优化手册。

In the good old days of Intel 80286, each instruction was executed at a fixed count of CPU cycles. Still, since Pentium Pro, released in 1995, Intel processors became superscalar, utilizing Complex Pipelining: Out-of-Order Execution & Register Renaming. Before that, on Pentium, produced in 1993, there were U and V pipelines. Therefore, Pentium introduced dual pipelines that could execute two simple instructions at one clock cycle if they didn't depend on one another. However, this was nothing compared with the Out-of-Order Execution & Register Renaming that appeared in Pentium Pro. This approach introduced in Pentium Pro is practically the same nowadays on most recent Intel processors.

Let me explain the Out-of-Order Execution in a few words. The fastest code is where instructions do not depend on previous results, e.g., you should always clear whole registers (by movzx) to remove dependency from previous values of the registers you are working with, so they may be renamed internally by the CPU to allow instruction execute in parallel or in a different order. Or, on some processors, false dependency may exist that may also slow things down, like false dependency on Pentium 4 for inc/dec, so you may wish to use add eax, 1 instead or inc eax to remove dependency on the previous state of the flags.

如果时间允许,您可以阅读更多无序执行和注册重命名。因特网上有大量的信息。

There are also many other essential issues like branch prediction, number of load and store units, number of gates that execute micro-ops, memory cache coherence protocols, etc., but the crucial thing to consider is the Out-of-Order Execution. Most people are simply not aware of the Out-of-Order Execution. Therefore, they write their assembly programs like for 80286, expecting their instructions will take a fixed time to execute regardless of the context. At the same time, C compilers are aware of the Out-of-Order Execution and generate the code correctly. That's why the code of such uninformed people is slower, but if you become knowledgeable, your code will be faster.

除了乱序执行之外,还有很多优化技巧和技巧。请阅读上面提到的优化手册:-)

However, assembly language has its own drawbacks when it comes to optimization. According to Peter Cordes (see the comment below), some of the optimizations compilers do would be unmaintainable for large code-bases in hand-written assembly. For example, suppose you write in assembly. In that case, you need to completely change an inline function (an assembly macro) when it inlines into a function that calls it with some arguments being constants. At the same time, a C compiler makes its job a lot simpler—and inlining the same code in different ways into different call sites. There is a limit to what you can do with assembly macros. So to get the same benefit, you'd have to manually optimize the same logic in each place to match the constants and available registers you have.

尽管C语言“接近”于对8位、16位、32位和64位数据的低级操作,但仍有一些C语言不支持的数学操作通常可以在某些汇编指令集中优雅地执行:

Fixed-point multiplication: The product of two 16-bit numbers is a 32-bit number. But the rules in C says that the product of two 16-bit numbers is a 16-bit number, and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases. If you want the top half of a 16x16 multiply or a 32x32 multiply, you have to play games with the compiler. The general method is to cast to a larger-than-necessary bit width, multiply, shift down, and cast back: int16_t x, y; // int16_t is a typedef for "short" // set x and y to something int16_t prod = (int16_t)(((int32_t)x*y)>>16);` In this case the compiler may be smart enough to know that you're really just trying to get the top half of a 16x16 multiply and do the right thing with the machine's native 16x16multiply. Or it may be stupid and require a library call to do the 32x32 multiply that's way overkill because you only need 16 bits of the product -- but the C standard doesn't give you any way to express yourself. Certain bitshifting operations (rotation/carries): // 256-bit array shifted right in its entirety: uint8_t x[32]; for (int i = 32; --i > 0; ) { x[i] = (x[i] >> 1) | (x[i-1] << 7); } x[0] >>= 1; This is not too inelegant in C, but again, unless the compiler is smart enough to realize what you are doing, it's going to do a lot of "unnecessary" work. Many assembly instruction sets allow you to rotate or shift left/right with the result in the carry register, so you could accomplish the above in 34 instructions: load a pointer to the beginning of the array, clear the carry, and perform 32 8-bit right-shifts, using auto-increment on the pointer. For another example, there are linear feedback shift registers (LFSR) that are elegantly performed in assembly: Take a chunk of N bits (8, 16, 32, 64, 128, etc), shift the whole thing right by 1 (see above algorithm), then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial.

尽管如此,除非有严重的性能限制,否则我不会求助于这些技术。正如其他人所说,汇编代码比C代码更难记录/调试/测试/维护:性能的提高伴随着一些严重的代价。

编辑:3。溢出检测在汇编中是可能的(在C中不能真正做到),这使得一些算法更容易。