为什么用于测试Collatz猜想的c++代码比手写的程序集运行得更快?

我为Project Euler Q14编写了这两个解决方案，用汇编和c++。他们采用了相同的蛮力方法来测试Collatz猜想。装配方案由:

nasm -felf64 p14.asm && gcc p14.o -o p14

c++是用以下工具编译的:

g++ p14.cpp -o p14

组装、p14.asm:

section .data
    fmt db "%d", 10, 0

global main
extern printf

section .text

main:
    mov rcx, 1000000
    xor rdi, rdi        ; max i
    xor rsi, rsi        ; i

l1:
    dec rcx
    xor r10, r10        ; count
    mov rax, rcx

l2:
    test rax, 1
    jpe even

    mov rbx, 3
    mul rbx
    inc rax
    jmp c1

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

c1:
    inc r10
    cmp rax, 1
    jne l2

    cmp rdi, r10
    cmovl rdi, r10
    cmovl rsi, rcx

    cmp rcx, 2
    jne l1

    mov rdi, fmt
    xor rax, rax
    call printf
    ret

c++, p14.cpp:

#include <iostream>

int sequence(long n) {
    int count = 1;
    while (n != 1) {
        if (n % 2 == 0)
            n /= 2;
        else
            n = 3*n + 1;
        ++count;
    }
    return count;
}

int main() {
    int max = 0, maxi;
    for (int i = 999999; i > 0; --i) {
        int s = sequence(i);
        if (s > max) {
            max = s;
            maxi = i;
        }
    }
    std::cout << maxi << std::endl;
}

我知道编译器优化可以提高速度，但是我没有看到很多方法来进一步优化我的汇编解决方案(从编程的角度说，而不是从数学角度说)。

c++代码每一项使用模数，每一项使用除法，而汇编代码每一项只使用一个除法。

但是这个程序集比c++解决方案平均要长1秒。为什么会这样?我问这个问题主要是出于好奇。

执行时间

我的系统:1.4 GHz Intel Celeron 2955U上的64位Linux (Haswell微架构)。

g++(未优化):平均1272毫秒。 g++ -O3:平均578毫秒。 Asm (div)(原始):平均2650毫秒。 Asm (shr):平均679 ms。 @johnfound asm(与NASM组装):平均501毫秒。 @hidefromkgb asm:平均200毫秒。 @hidefromkgb asm，由@Peter Cordes优化:平均145毫秒。 @Veedrac c++: -O3平均81毫秒，-O0平均305毫秒。

当前回答

c++程序在从源代码生成机器码的过程中被转换为汇编程序。说汇编比c++慢实际上是错误的。此外，生成的二进制代码因编译器而异。因此，一个聪明的c++编译器可能会生成比愚蠢的汇编器代码更优、更有效的二进制代码。

但我认为你的分析方法有一定缺陷。以下是概要分析的一般准则:

确保您的系统处于正常/空闲状态。停止所有已启动或大量使用CPU的正在运行的进程(应用程序)(或在网络上轮询)。您的数据必须更大。您的测试必须运行5-10秒以上。不要仅仅依赖于一个样本。执行测试N次。收集结果并计算结果的平均值或中位数。

2016-11-01 06:26:36

其他回答

即使不考虑汇编，最明显的原因是/= 2可能被优化为>>=1，许多处理器具有非常快的移位操作。但即使处理器没有移位操作，整数除法也比浮点除法快。

编辑:在上面的“整数除法比浮点除法快”语句中，您的里程数可能会有所不同。下面的注释表明，现代处理器已经优先优化fp除法而不是整数除法。因此，如果有人正在寻找这个线程的问题所要求的最可能的加速原因，那么编译器优化/=2为>>=1将是最好的第一个地方。

与此无关的是，如果n是奇数，表达式n*3+1总是偶数。所以不需要检查。您可以将分支更改为

{
   n = (n*3+1) >> 1;
   count += 2;
}

所以整个表述就是

if (n & 1)
{
    n = (n*3 + 1) >> 1;
    count += 2;
}
else
{
    n >>= 1;
    ++count;
}

2016-11-01 21:16:20

但我认为你的分析方法有一定缺陷。以下是概要分析的一般准则:

2016-11-01 06:26:36

如果您认为64位DIV指令是除2的好方法，那么即使使用-O0(编译速度快，无需额外优化，并且在每条C语句之后/之前将/重新加载到内存中，以便调试器可以修改变量)，编译器的asm输出也就不足为奇了。

请参阅Agner Fog的优化汇编指南，了解如何编写高效的asm。他还提供了用于特定cpu的特定细节的指令表和microarch指南。参见x86标签维基获取更多性能链接。

另请参阅关于用手写asm击败编译器的更普遍的问题:内联汇编语言比原生c++代码慢吗?TL:DR:是的，如果你做错了(比如这个问题)。

通常你可以让编译器完成它的工作，特别是如果你试图编写能够高效编译的c++。还可以看到汇编语言比编译语言快吗?其中一个答案链接到这些简洁的幻灯片，展示了各种C编译器如何用很酷的技巧优化一些非常简单的函数。Matt Godbolt的CppCon2017演讲“我的编译器最近为我做了什么?”《打开编译器的盖子》也是类似的。

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

在Intel Haswell上，div r64是36 uops，延迟为32-96个周期，吞吐量为每21-74个周期一个。(加上2个uops来设置RBX和零RDX，但乱序执行可以提前运行它们)。像DIV这样的高计数指令是微编码的，这也会导致前端瓶颈。在这种情况下，延迟是最相关的因素，因为它是循环携带依赖链的一部分。

shr rax, 1做同样的无符号除法:它是1uop，有1c的延迟，每个时钟周期可以运行2。

相比之下，32位除法更快，但与移位相比仍然很糟糕。idiv r32在Haswell上是9 uops, 22-29c延迟，每8-11c吞吐量一个。

从gcc的-O0 asm输出(Godbolt编译器资源管理器)可以看出，它只使用移位指令。clang -O0确实像你想的那样简单地编译，甚至两次使用64位的IDIV。(在优化时，如果编译器使用IDIV的话，当源程序使用相同的操作数进行除法和模量时，编译器会同时使用IDIV的输出)

GCC没有完全幼稚的模式;它总是通过GIMPLE进行转换，这意味着一些“优化”不能被禁用。这包括识别按常数除法和使用移位(2的幂)或定点乘法逆(非2的幂)来避免IDIV(参见上面godbolt链接中的div_by_13)。

gcc -Os(优化大小)确实使用IDIV进行非2的幂除法，不幸的是，即使在乘法逆码仅略大但速度快得多的情况下。

帮助编译器

(这种情况下的总结:使用uint64_t n)

首先，只关注优化后的编译器输出。(o3)。 -O0速度基本上没有意义。

Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in @Veedrac's code below.)

This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on @johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.

My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.

手写的asm是优化器的黑盒，因此当内联使输入成为编译时常数时，常量传播将不起作用。其他优化也会受到影响。使用asm前请阅读https://gcc.gnu.org/wiki/DontUseInlineAsm。(并且避免msvc风格的内联asm:输入/输出必须经过内存，这会增加开销。)

在本例中:n具有符号类型，gcc使用SAR/SHR/ADD序列来给出正确的舍入。(对于负输入，IDIV和arithmetic-shift“round”不同，请参阅SAR insn set ref manual entry)。(IDK如果gcc试图证明n不可能是负的，或者什么。Signed-overflow是未定义的行为，所以它应该能够。)

你应该使用uint64_tn，所以它可以SHR。因此，它可以移植到long仅为32位的系统(例如x86-64 Windows)。

顺便说一句，gcc优化的asm输出看起来很不错(使用unsigned long n):它内联到main()的内循环是这样做的:

 # from gcc5.4 -O3  plus my comments

 # edx= count=1
 # rax= uint64_t n

.L9:                   # do{
    lea    rcx, [rax+1+rax*2]   # rcx = 3*n + 1
    mov    rdi, rax
    shr    rdi         # rdi = n>>1;
    test   al, 1       # set flags based on n%2 (aka n&1)
    mov    rax, rcx
    cmove  rax, rdi    # n= (n%2) ? 3*n+1 : n/2;
    add    edx, 1      # ++count;
    cmp    rax, 1
    jne   .L9          #}while(n!=1)

  cmp/branch to update max and maxi, and then do the next n

内环是无分支的，循环携带依赖链的关键路径为:

3组分LEA(3个周期) cmov (Haswell上2个周期，Broadwell上1c或更晚)。

总计:每次迭代5个周期，延迟瓶颈。乱序执行处理与此并行的所有其他事情(理论上:我没有测试性能计数器，看看它是否真的运行在5c/iter)。

cmov的FLAGS输入(由TEST生成)比RAX输入(从LEA->MOV生成)更快，所以它不在关键路径上。

类似地，产生CMOV的RDI输入的MOV->SHR也不在关键路径上，因为它也比LEA快。MOV在IvyBridge和以后有零延迟(在注册重命名时间处理)。(它仍然需要一个uop和管道中的一个槽，所以它不是免费的，只是零延迟)。LEA dep链中额外的MOV是其他cpu瓶颈的一部分。

cmp/jne也不是关键路径的一部分:它不是循环携带的，因为控制依赖是通过分支预测+推测执行来处理的，不像关键路径上的数据依赖。

击败编译器

GCC在这里做得很好。使用inc edx而不是add edx, 1可以节省一个代码字节，因为没有人关心P4及其对部分标志修改指令的错误依赖。

它还可以保存所有的MOV指令，并且TEST: SHR设置CF=移出的位，所以我们可以使用cmovc而不是TEST / cmovz。

 ### Hand-optimized version of what gcc does
.L9:                       #do{
    lea     rcx, [rax+1+rax*2] # rcx = 3*n + 1
    shr     rax, 1         # n>>=1;    CF = n&1 = n%2
    cmovc   rax, rcx       # n= (n&1) ? 3*n+1 : n/2;
    inc     edx            # ++count;
    cmp     rax, 1
    jne     .L9            #}while(n!=1)

看到@johnfound的另一个聪明的技巧的答案:删除CMP通过分支SHR的标志结果，以及使用它为CMOV:零，只有当n是1(或0)开始。(有趣的事实:在Nehalem或更早的情况下，count != 1的SHR会在读取标志结果时导致失速。他们就是这样做的。不过，按1移位的特殊编码是可以的。)

Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency (and Ice Lake with updated microcode). The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See @johnfound's answer, where he has timing results from an AMD CPU.

即使在Haswell上，这个版本也可以帮助避免一些偶尔的延迟，即非关键的uop从关键路径上的一个uop窃取执行端口，将执行延迟1个周期。(这称为资源冲突)。它还保存了一个寄存器，这可能有助于在交错循环中并行执行多个n值(见下文)。

LEA的延迟取决于寻址模式，在英特尔snb系列cpu上。3c用于3个组件([base+idx+const]，它需要两次单独的添加)，但只有1c用于2个或更少的组件(一次添加)。一些cpu(如Core2)甚至可以在一个周期内完成3个组件的LEA，但snb家族没有。更糟糕的是，英特尔snb家族标准化了延迟，所以没有2c的uops，否则3组件LEA就会像推土机一样只有2c。(3组件LEA在AMD上也较慢，只是没有那么多)。

所以lea rcx， [rax + rax*2] / inc rcx只有2c的延迟，比lea rcx， [rax + rax*2 + 1]快，在英特尔snb系列cpu上，如Haswell。在BD上收支平衡，在Core2上更糟。它确实会花费额外的uop，为了节省1℃的延迟，这通常是不值得的，但延迟是这里的主要瓶颈，Haswell有一个足够宽的管道来处理额外的uop吞吐量。

Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)

简单的循环展开不会有帮助;这个循环的瓶颈是在一个循环携带的依赖链的延迟上，而不是在循环开销/吞吐量上。这意味着它可以很好地处理超线程(或任何其他类型的SMT)，因为CPU有大量的时间来交错来自两个线程的指令。这将意味着在main中并行循环，但这很好，因为每个线程只能检查n个值的范围，并产生一对整数作为结果。

在单个线程中手工交织也可能是可行的。也许可以并行计算一对数字的序列，因为每个数字只需要两个寄存器，而且它们都可以更新相同的max / maxi。这将创建更多的指令级并行性。

诀窍在于决定是否要等到所有n个值都达到1时才能获得另一对起始n个值，或者是否为达到结束条件的一个序列爆发并获得一个新的起点，而不触及另一个序列的寄存器。可能最好是让每个链都处理有用的数据，否则就必须有条件地增加它的计数器。

你甚至可以用SSE packs来做这件事——对n还没有达到1的向量元素进行有条件地递增计数器。然后，为了隐藏SIMD条件增量实现的更长的延迟，您需要保留更多的n个值的向量。也许只有价值与256b向量(4x uint64_t)。

我认为让1的检测变得“有粘性”的最佳策略是掩盖你添加的所有1的向量，以增加计数器。因此，在元素中看到1之后，增量向量将有一个0，+=0是一个无操作。

手工矢量化的未经测试的想法

# starting with YMM0 = [ n_d, n_c, n_b, n_a ]  (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1):  increment vector
# ymm5 = all-zeros:  count vector

.inner_loop:
    vpaddq    ymm1, ymm0, xmm0
    vpaddq    ymm1, ymm1, xmm0
    vpaddq    ymm1, ymm1, set1_epi64(1)     # ymm1= 3*n + 1.  Maybe could do this more efficiently?

    vpsllq    ymm3, ymm0, 63                # shift bit 1 to the sign bit

    vpsrlq    ymm0, ymm0, 1                 # n /= 2

    # FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
    vpblendvpd ymm0, ymm0, ymm1, ymm3       # variable blend controlled by the sign bit of each 64-bit element.  I might have the source operands backwards, I always have to look this up.

    # ymm0 = updated n  in each element.

    vpcmpeqq ymm1, ymm0, set1_epi64(1)
    vpandn   ymm4, ymm1, ymm4         # zero out elements of ymm4 where the compare was true

    vpaddq   ymm5, ymm5, ymm4         # count++ in elements where n has never been == 1

    vptest   ymm4, ymm4
    jnz  .inner_loop
    # Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero

    vextracti128 ymm0, ymm5, 1
    vpmaxq .... crap this doesn't exist
    # Actually just delay doing a horizontal max until the very very end.  But you need some way to record max and maxi.

您可以并且应该使用intrinsic而不是手写asm来实现这一点。

算法/实现改进:

除了使用更有效的asm实现相同的逻辑之外，还要寻找简化逻辑或避免冗余工作的方法。例如，记忆检测序列的共同结尾。或者更好的是，一次查看8个尾随位(gnasher的答案)

@EOF指出tzcnt(或bsf)可以用于在一个步骤中执行多个n/=2迭代。这可能比SIMD向量化更好;没有SSE或AVX指令可以做到。不过，它仍然兼容在不同的整数寄存器中并行执行多个标量ns。

所以循环可能是这样的:

goto loop_entry;  // C++ structured like the asm, for illustration only
do {
   n = n*3 + 1;
  loop_entry:
   shift = _tzcnt_u64(n);
   n >>= shift;
   count += shift;
} while(n != 1);

This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).

它还将tzcnt (Haswell上的3c)放在关键路径上，因此它显著延长了循环携带依赖链的总延迟。不过，它确实不需要CMOV，也不需要准备一个存储n>>1的寄存器。@Veedrac的答案通过延迟tzcnt/shift进行多次迭代来克服所有这些问题，这是非常有效的(见下文)。

我们可以安全地互换使用BSF或TZCNT，因为在这一点上n永远不可能为零。TZCNT的机器代码在不支持BMI1的cpu上解码为BSF。(无意义的前缀将被忽略，因此REP BSF将作为BSF运行)。

在支持TZCNT的AMD cpu上，TZCNT的性能要比BSF好得多，因此使用REP BSF是一个好主意，即使您不关心如果输入为零而不是输出时设置ZF。当你使用__builtin_ctzll时，一些编译器会这样做，即使是使用-mno-bmi。

They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)

Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for @Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.

其他答案对想法/代码的改进:

@hidefromkgb的回答有一个很好的观察，你保证能在3n+1之后做一次右移。你可以更有效地计算它而不仅仅是省略步骤之间的检查。答案中的asm实现是坏的，尽管(它依赖于OF，在计数> 1的SHRD之后是未定义的)，并且很慢:ROR rdi,2比SHRD rdi,rdi,2快，并且在关键路径上使用两个CMOV指令比可以并行运行的额外TEST慢。

我把整理/改进的C(指导编译器产生更好的asm)，和测试+工作更快的asm(在C下面的评论)在Godbolt上:见链接在@hidefromkgb的答案。(这个答案达到了Godbolt大url的30k字符限制，但短链接可能会腐烂，而且对goo来说太长了。gl。)

还改进了输出打印转换为字符串，并使一个write()而不是一次写入一个char。这最大限度地减少了使用perf stat ./collatz(记录性能计数器)对整个程序计时的影响，并且我消除了一些非关键asm的混淆。

@Veedrac的代码

我从右移得到了一个我们知道需要做的小加速，并检查继续循环。在Core2Duo (Merom)上，从limit=1e8的7.5s降至7.275s，展开系数为16。

Godbolt的代码+注释。不要使用这个带有叮当声的版本;它对延迟循环做了一些愚蠢的事情。使用tmp计数器k，然后将其添加到计数中，这会改变clang所做的工作，但这会略微影响gcc。

参见评论中的讨论:Veedrac的代码在BMI1的cpu上非常出色(即不是赛扬/奔腾)

2016-11-01 07:04:26

你没有发布编译器生成的代码，所以这里有一些猜测，但即使没有看到它，人们也可以说:

test rax, 1
jpe even

．.．有50%的几率预测错分支，代价会很高昂。

编译器几乎肯定会进行这两种计算(因为div/mod的延迟相当长，所以乘法-加法是“免费的”)，并随后进行CMOV。当然，被错误预测的可能性为零。

2016-11-01 19:50:03

声称c++编译器可以产生比合格的汇编语言程序员更优的代码是一个非常严重的错误。尤其是在这种情况下。人类总是能比编译器编写出更好的代码，这种特殊情况很好地说明了这一点。

您看到的时间差异是因为问题中的汇编代码在内部循环中远远不是最优的。

(下面的代码是32位的，但可以很容易地转换为64位)

例如，序列函数可以优化到只有5条指令:

    .seq:
        inc     esi                 ; counter
        lea     edx, [3*eax+1]      ; edx = 3*n+1
        shr     eax, 1              ; eax = n/2
        cmovc   eax, edx            ; if CF eax = edx
        jnz     .seq                ; jmp if n<>1

整个代码看起来像这样:

include "%lib%/freshlib.inc"
@BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"

start:
        InitializeAll
        mov ecx, 999999
        xor edi, edi        ; max
        xor ebx, ebx        ; max i

    .main_loop:

        xor     esi, esi
        mov     eax, ecx

    .seq:
        inc     esi                 ; counter
        lea     edx, [3*eax+1]      ; edx = 3*n+1
        shr     eax, 1              ; eax = n/2
        cmovc   eax, edx            ; if CF eax = edx
        jnz     .seq                ; jmp if n<>1

        cmp     edi, esi
        cmovb   edi, esi
        cmovb   ebx, ecx

        dec     ecx
        jnz     .main_loop

        OutputValue "Max sequence: ", edi, 10, -1
        OutputValue "Max index: ", ebx, 10, -1

        FinalizeAll
        stdcall TerminateAll, 0

为了编译这段代码，需要使用FreshLib。

在我的测试中，(1 GHz AMD A4-1200处理器)，上面的代码大约比问题中的c++代码快四倍(当用-O0编译时:430毫秒vs. 1900毫秒)，当用-O3编译时，快两倍多(430毫秒vs. 830毫秒)。

两个程序的输出是相同的:在i = 837799上，max sequence = 525。

2016-11-01 08:29:38

为什么用于测试Collatz猜想的c++代码比手写的程序集运行得更快?

推荐文章

最新文章

标签