了解汇编程序的原因之一是,有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而,我也听人说过很多次,尽管这并非完全错误,但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见,并且需要汇编方面的专业知识和经验。

这个问题甚至没有涉及到这样一个事实,即汇编程序指令将是特定于机器的、不可移植的,或者汇编程序的任何其他方面。当然,除了这一点之外,了解汇编还有很多很好的理由,但这是一个需要示例和数据的具体问题,而不是关于汇编程序与高级语言的扩展论述。

谁能提供一些具体的例子,说明使用现代编译器汇编代码比编写良好的C代码更快,并且您能否用分析证据支持这一说法?我相信这些案例确实存在,但我真的很想知道这些案例到底有多深奥,因为这似乎是一个有争议的问题。


当前回答

几乎任何时候编译器看到浮点代码,如果你使用的是旧的糟糕的编译器,手写的版本会更快。(2019年更新:对于现代编译器来说,这并不普遍。特别是在编译x87以外的东西时;编译器更容易使用SSE2或AVX进行标量数学运算,或任何具有平面FP寄存器集的非x86,不像x87的寄存器堆栈。)

主要原因是编译器不能执行任何健壮的优化。关于这个主题的讨论,请参阅来自MSDN的这篇文章。下面是一个例子,其中汇编版本的速度是C版本的两倍(用VS2K5编译):

#include "stdafx.h"
#include <windows.h>

float KahanSum(const float *data, int n)
{
   float sum = 0.0f, C = 0.0f, Y, T;

   for (int i = 0 ; i < n ; ++i) {
      Y = *data++ - C;
      T = sum + Y;
      C = T - sum - Y;
      sum = T;
   }

   return sum;
}

float AsmSum(const float *data, int n)
{
  float result = 0.0f;

  _asm
  {
    mov esi,data
    mov ecx,n
    fldz
    fldz
l1:
    fsubr [esi]
    add esi,4
    fld st(0)
    fadd st(0),st(2)
    fld st(0)
    fsub st(0),st(3)
    fsub st(0),st(2)
    fstp st(2)
    fstp st(2)
    loop l1
    fstp result
    fstp result
  }

  return result;
}

int main (int, char **)
{
  int count = 1000000;

  float *source = new float [count];

  for (int i = 0 ; i < count ; ++i) {
    source [i] = static_cast <float> (rand ()) / static_cast <float> (RAND_MAX);
  }

  LARGE_INTEGER start, mid, end;

  float sum1 = 0.0f, sum2 = 0.0f;

  QueryPerformanceCounter (&start);

  sum1 = KahanSum (source, count);

  QueryPerformanceCounter (&mid);

  sum2 = AsmSum (source, count);

  QueryPerformanceCounter (&end);

  cout << "  C code: " << sum1 << " in " << (mid.QuadPart - start.QuadPart) << endl;
  cout << "asm code: " << sum2 << " in " << (end.QuadPart - mid.QuadPart) << endl;

  return 0;
}

和一些数字从我的PC运行默认版本*:

  C code: 500137 in 103884668
asm code: 500137 in 52129147

出于兴趣,我用dec/jnz交换了循环,它对计时没有影响——有时更快,有时更慢。我想内存有限的方面使其他优化相形见绌。(编者注:更可能的情况是,FP延迟瓶颈足以隐藏循环的额外成本。对奇数/偶数元素并行进行两个Kahan求和,并在最后添加它们,可能会加快2倍的速度。)

哎呀,我正在运行一个稍微不同的代码版本,它输出的数字是错误的(即C更快!)修正并更新了结果。

其他回答

CP/M-86版本的PolyPascal (Turbo Pascal的兄弟)的一个可能性是用机器语言例程取代“使用生物将字符输出到屏幕上”的功能,本质上是给定x、y和字符串放在那里。

这使得更新屏幕的速度比以前快得多!

二进制文件中有足够的空间来嵌入机器代码(几百个字节),也有其他的东西,所以尽可能多地压缩是必要的。

事实证明,由于屏幕是80x25,这两个坐标都可以容纳每个字节,所以都可以容纳两个字节的单词。这允许在更少的字节内完成所需的计算,因为单个添加可以同时操作两个值。

据我所知,没有C编译器可以在一个寄存器中合并多个值,对它们执行SIMD指令,然后再将它们分开(而且我不认为机器指令会更短)。

很多年前,我教别人用c语言编程。练习是将图形旋转90度。他得到了一个花了几分钟才能完成的解,主要是因为他使用了乘法和除法等。

我向他展示了如何使用位移位重定义问题,在他拥有的非优化编译器上,处理时间缩短到大约30秒。

我刚刚得到了一个优化编译器,相同的代码在< 5秒内旋转图形。我看着编译器生成的汇编代码,从我所看到的,我决定我写汇编程序的日子结束了。

我已经阅读了所有的答案(超过30个),并没有找到一个简单的原因:如果你读过并练习过Intel®64和IA-32架构优化参考手册,汇编程序比C更快,所以汇编程序可能更慢的原因是编写这种慢汇编程序的人没有阅读优化手册。

In the good old days of Intel 80286, each instruction was executed at a fixed count of CPU cycles. Still, since Pentium Pro, released in 1995, Intel processors became superscalar, utilizing Complex Pipelining: Out-of-Order Execution & Register Renaming. Before that, on Pentium, produced in 1993, there were U and V pipelines. Therefore, Pentium introduced dual pipelines that could execute two simple instructions at one clock cycle if they didn't depend on one another. However, this was nothing compared with the Out-of-Order Execution & Register Renaming that appeared in Pentium Pro. This approach introduced in Pentium Pro is practically the same nowadays on most recent Intel processors.

Let me explain the Out-of-Order Execution in a few words. The fastest code is where instructions do not depend on previous results, e.g., you should always clear whole registers (by movzx) to remove dependency from previous values of the registers you are working with, so they may be renamed internally by the CPU to allow instruction execute in parallel or in a different order. Or, on some processors, false dependency may exist that may also slow things down, like false dependency on Pentium 4 for inc/dec, so you may wish to use add eax, 1 instead or inc eax to remove dependency on the previous state of the flags.

如果时间允许,您可以阅读更多无序执行和注册重命名。因特网上有大量的信息。

There are also many other essential issues like branch prediction, number of load and store units, number of gates that execute micro-ops, memory cache coherence protocols, etc., but the crucial thing to consider is the Out-of-Order Execution. Most people are simply not aware of the Out-of-Order Execution. Therefore, they write their assembly programs like for 80286, expecting their instructions will take a fixed time to execute regardless of the context. At the same time, C compilers are aware of the Out-of-Order Execution and generate the code correctly. That's why the code of such uninformed people is slower, but if you become knowledgeable, your code will be faster.

除了乱序执行之外,还有很多优化技巧和技巧。请阅读上面提到的优化手册:-)

However, assembly language has its own drawbacks when it comes to optimization. According to Peter Cordes (see the comment below), some of the optimizations compilers do would be unmaintainable for large code-bases in hand-written assembly. For example, suppose you write in assembly. In that case, you need to completely change an inline function (an assembly macro) when it inlines into a function that calls it with some arguments being constants. At the same time, a C compiler makes its job a lot simpler—and inlining the same code in different ways into different call sites. There is a limit to what you can do with assembly macros. So to get the same benefit, you'd have to manually optimize the same logic in each place to match the constants and available registers you have.

这完全取决于你的工作量。

对于日常操作,C和c++已经很好了,但是有一些特定的工作负载(任何涉及视频的转换(压缩、解压缩、图像效果等))几乎需要组装才能达到性能。

它们通常还涉及使用特定于CPU的芯片组扩展(MME/MMX/SSE/等等),这些扩展是为这些类型的操作而优化的。

第一点不是答案。 即使你从来没有用它编程,我发现至少知道一个汇编指令集是有用的。这是程序员永无止境的追求的一部分,他们想知道得更多,从而变得更好。当你进入一个没有源代码的框架时,它也很有用,至少对正在发生的事情有一个粗略的了解。它还可以帮助您理解JavaByteCode和. net IL,因为它们都类似于汇编程序。

To answer the question when you have a small amount of code or a large amount of time. Most useful for use in embedded chips, where low chip complexity and poor competition in compilers targeting these chips can tip the balance in favour of humans. Also for restricted devices you are often trading off code size/memory size/performance in a way that would be hard to instruct a compiler to do. e.g. I know this user action is not called often so I will have small code size and poor performance, but this other function that look similar is used every second so I will have a larger code size and faster performance. That is the sort of trade off a skilled assembly programmer can use.

我还想补充一点,这里有很多中间地带,您可以用C编译代码并检查生成的程序集,然后更改C代码或调整并作为程序集进行维护。

我的朋友从事微控制器的工作,目前是用于控制小型电动机的芯片。他在低级c和汇编的组合中工作。他曾经告诉我,有一天他在工作中把主循环从48条指令减少到43条。他还面临着各种选择,比如代码已经增长到填满256k芯片,业务需要一个新功能,你呢

删除现有功能 减少部分或全部现有特性的大小,可能会以性能为代价。 提倡改用成本更高、功耗更高、外形更大的更大芯片。

我想补充一点,作为一个商业开发人员,我有很多的投资组合或语言、平台、应用程序类型,我从来没有觉得有必要深入编写程序集。我一直都很感激我所学到的知识。有时会被调试进去。

我知道我已经回答了“为什么我要学习汇编器”这个问题,但我觉得这是一个更重要的问题,而不是什么时候更快。

所以让我们再试一次 你应该考虑组装

致力于底层操作系统功能 在编译器上工作。 工作在一个极其有限的芯片,嵌入式系统等

记住比较你的程序集和生成的编译器,看看哪个更快/更小/更好。

大卫。