了解汇编程序的原因之一是,有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而,我也听人说过很多次,尽管这并非完全错误,但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见,并且需要汇编方面的专业知识和经验。

这个问题甚至没有涉及到这样一个事实,即汇编程序指令将是特定于机器的、不可移植的,或者汇编程序的任何其他方面。当然,除了这一点之外,了解汇编还有很多很好的理由,但这是一个需要示例和数据的具体问题,而不是关于汇编程序与高级语言的扩展论述。

谁能提供一些具体的例子,说明使用现代编译器汇编代码比编写良好的C代码更快,并且您能否用分析证据支持这一说法?我相信这些案例确实存在,但我真的很想知道这些案例到底有多深奥,因为这似乎是一个有争议的问题。


当前回答

在我的工作中,有三个原因让我了解和使用组装。按重要性排序:

Debugging - I often get library code that has bugs or incomplete documentation. I figure out what it's doing by stepping in at the assembly level. I have to do this about once a week. I also use it as a tool to debug problems in which my eyes don't spot the idiomatic error in C/C++/C#. Looking at the assembly gets past that. Optimizing - the compiler does fairly well in optimizing, but I play in a different ballpark than most. I write image processing code that usually starts with code that looks like this: for (int y=0; y < imageHeight; y++) { for (int x=0; x < imageWidth; x++) { // do something } } the "do something part" typically happens on the order of several million times (ie, between 3 and 30). By scraping cycles in that "do something" phase, the performance gains are hugely magnified. I don't usually start there - I usually start by writing the code to work first, then do my best to refactor the C to be naturally better (better algorithm, less load in the loop etc). I usually need to read assembly to see what's going on and rarely need to write it. I do this maybe every two or three months. doing something the language won't let me. These include - getting the processor architecture and specific processor features, accessing flags not in the CPU (man, I really wish C gave you access to the carry flag), etc. I do this maybe once a year or two years.

其他回答

以下是我个人经历中的几个例子:

Access to instructions that are not accessible from C. For instance, many architectures (like x86-64, IA-64, DEC Alpha, and 64-bit MIPS or PowerPC) support a 64 bit by 64 bit multiplication producing a 128 bit result. GCC recently added an extension providing access to such instructions, but before that assembly was required. And access to this instruction can make a huge difference on 64-bit CPUs when implementing something like RSA - sometimes as much as a factor of 4 improvement in performance. Access to CPU-specific flags. The one that has bitten me a lot is the carry flag; when doing a multiple-precision addition, if you don't have access to the CPU carry bit one must instead compare the result to see if it overflowed, which takes 3-5 more instructions per limb; and worse, which are quite serial in terms of data accesses, which kills performance on modern superscalar processors. When processing thousands of such integers in a row, being able to use addc is a huge win (there are superscalar issues with contention on the carry bit as well, but modern CPUs deal pretty well with it). SIMD. Even autovectorizing compilers can only do relatively simple cases, so if you want good SIMD performance it's unfortunately often necessary to write the code directly. Of course you can use intrinsics instead of assembly but once you're at the intrinsics level you're basically writing assembly anyway, just using the compiler as a register allocator and (nominally) instruction scheduler. (I tend to use intrinsics for SIMD simply because the compiler can generate the function prologues and whatnot for me so I can use the same code on Linux, OS X, and Windows without having to deal with ABI issues like function calling conventions, but other than that the SSE intrinsics really aren't very nice - the Altivec ones seem better though I don't have much experience with them). As examples of things a (current day) vectorizing compiler can't figure out, read about bitslicing AES or SIMD error correction - one could imagine a compiler that could analyze algorithms and generate such code, but it feels to me like such a smart compiler is at least 30 years away from existing (at best).

On the other hand, multicore machines and distributed systems have shifted many of the biggest performance wins in the other direction - get an extra 20% speedup writing your inner loops in assembly, or 300% by running them across multiple cores, or 10000% by running them across a cluster of machines. And of course high level optimizations (things like futures, memoization, etc) are often much easier to do in a higher level language like ML or Scala than C or asm, and often can provide a much bigger performance win. So, as always, there are tradeoffs to be made.

这很难具体地回答,因为这个问题非常不具体:到底什么是“现代编译器”?

理论上,几乎任何手动的汇编器优化都可以由编译器来完成——实际上它是否已经完成,不能笼统地说,只能说特定编译器的特定版本。许多可能需要花费大量的精力来确定它们是否可以在特定的上下文中应用而不产生副作用,以至于编译器编写者不会为它们烦恼。

几乎任何时候编译器看到浮点代码,如果你使用的是旧的糟糕的编译器,手写的版本会更快。(2019年更新:对于现代编译器来说,这并不普遍。特别是在编译x87以外的东西时;编译器更容易使用SSE2或AVX进行标量数学运算,或任何具有平面FP寄存器集的非x86,不像x87的寄存器堆栈。)

主要原因是编译器不能执行任何健壮的优化。关于这个主题的讨论,请参阅来自MSDN的这篇文章。下面是一个例子,其中汇编版本的速度是C版本的两倍(用VS2K5编译):

#include "stdafx.h"
#include <windows.h>

float KahanSum(const float *data, int n)
{
   float sum = 0.0f, C = 0.0f, Y, T;

   for (int i = 0 ; i < n ; ++i) {
      Y = *data++ - C;
      T = sum + Y;
      C = T - sum - Y;
      sum = T;
   }

   return sum;
}

float AsmSum(const float *data, int n)
{
  float result = 0.0f;

  _asm
  {
    mov esi,data
    mov ecx,n
    fldz
    fldz
l1:
    fsubr [esi]
    add esi,4
    fld st(0)
    fadd st(0),st(2)
    fld st(0)
    fsub st(0),st(3)
    fsub st(0),st(2)
    fstp st(2)
    fstp st(2)
    loop l1
    fstp result
    fstp result
  }

  return result;
}

int main (int, char **)
{
  int count = 1000000;

  float *source = new float [count];

  for (int i = 0 ; i < count ; ++i) {
    source [i] = static_cast <float> (rand ()) / static_cast <float> (RAND_MAX);
  }

  LARGE_INTEGER start, mid, end;

  float sum1 = 0.0f, sum2 = 0.0f;

  QueryPerformanceCounter (&start);

  sum1 = KahanSum (source, count);

  QueryPerformanceCounter (&mid);

  sum2 = AsmSum (source, count);

  QueryPerformanceCounter (&end);

  cout << "  C code: " << sum1 << " in " << (mid.QuadPart - start.QuadPart) << endl;
  cout << "asm code: " << sum2 << " in " << (end.QuadPart - mid.QuadPart) << endl;

  return 0;
}

和一些数字从我的PC运行默认版本*:

  C code: 500137 in 103884668
asm code: 500137 in 52129147

出于兴趣,我用dec/jnz交换了循环,它对计时没有影响——有时更快,有时更慢。我想内存有限的方面使其他优化相形见绌。(编者注:更可能的情况是,FP延迟瓶颈足以隐藏循环的额外成本。对奇数/偶数元素并行进行两个Kahan求和,并在最后添加它们,可能会加快2倍的速度。)

哎呀,我正在运行一个稍微不同的代码版本,它输出的数字是错误的(即C更快!)修正并更新了结果。

这个问题有点毫无意义,因为无论如何c都是编译到汇编程序的。 但是,通过优化编译器生成的汇编程序几乎是完全优化的,所以除非你在优化特定的汇编程序方面做了20个博士学位,否则你无法打败编译器。

只有在使用编译器不支持的特殊用途指令集时。

为了最大限度地利用具有多个管道和预测分支的现代CPU的计算能力,您需要以这样一种方式来构造汇编程序:a)人类几乎不可能编写b)甚至更不可能维护。

此外,更好的算法、数据结构和内存管理将为您提供至少一个数量级的性能,而不是在汇编中进行的微观优化。