This method of finite differences strength-reduction optimization can give a speedup over the best you can do re-evaluating the polynomial separately for each i. But only if you generalize it to a larger stride, to still have enough parallelism in the loop. My version stores one vector (four doubles) per clock cycle on my Skylake, for a small array that fits in L1d cache; otherwise it's a bandwidth test. On earlier Intel, it should also max out SIMD FP-add throughput, including your Sandy Bridge with AVX (1x 256-bit add/clock, and 1x 256-bit store per two clocks, if you align the output.)
依赖于前一个迭代的值是致命的
这种强度降低优化(只是添加,而不是从一个新的i开始并乘以)引入了跨循环迭代的串行依赖,涉及FP数学而不是整数增量。
原始版本在每个输出元素之间都具有数据并行性:每个输出元素只依赖于常量和它自己的i值。编译器可以使用SIMD(如果使用-O3 -march=native,则可以使用SSE2或AVX)自动向量化,cpu可以跨循环迭代与无序执行重叠工作。尽管有大量的额外工作,但在编译器的帮助下,CPU能够应用足够的蛮力。
但是用poly(i)来计算poly(i+1)的版本并行度非常有限;没有SIMD向量化,你的CPU只能运行两个标量加法每四个周期,例如,四个周期是FP加法从英特尔Skylake到Tiger Lake的延迟。(https://uops.info/)。
huseyin tugrul buyukisik的回答展示了如何在更现代的CPU上接近于最大化原始版本的吞吐量,使用两个FMA操作来计算多项式(Horner的方案),加上整数到浮点数的转换或浮点数增量。(后者会创建一个FP依赖链,您需要展开来隐藏它。)
So best case you have three floating point math operations per SIMD vector of output. (Plus a store). Current Intel CPUs only have two floating point execution units that can run FP math operations including int-to-double. (With 512-bit vectors, current CPUs shut down the vector ALU on port 1 so there are only two SIMD ALU ports at all, so non-FP-math operations, like SIMD-integer increment, will also compete for SIMD throughput. Except for CPUs with only one 512-bit FMA unit, then port 5 is free for other work.)
AMD因为Zen 2在两个端口上有两个FMA/mul单元,在两个不同的端口上有两个FP添加/子单元,所以如果使用FMA进行添加,理论上每个时钟周期最多有四个SIMD添加。
Haswell/Broadwell有2个时钟FMA,但只有1个时钟FP add/sub(延迟较低)。这对于简单的代码来说很好,但对于那些优化为具有大量并行性的代码来说就不太好了。这可能就是英特尔在Skylake上改名字的原因。(Alder Lake重新引入了低延迟的FP add/sub,但2/时钟吞吐量与乘法相同。有趣的是,非交换延迟:目的地只有2个周期,另一个操作数只有3个周期,所以它非常适合较长的依赖链。)
您的Sandy Bridge (E5-1620)和Nehalem (W5580) cpu在单独的端口上有1/clock FP add/sub, 1/clock FP mul。这就是哈斯威尔的基础。为什么添加额外的乘法不是一个大问题:它们可以与现有的加法并行运行。(Sandy Bridge的是256位宽,但你编译时没有启用AVX:使用-march=native。)
寻找并行性:任意步幅的力量减少
你的compute2根据前一个值计算下一个Y和下一个Z。例如,步幅为1时,数据[i+1]所需的值。所以每次迭代都依赖于前一次迭代。
如果您将其推广到其他跨步,您可以将4、6、8或更多单独的Y和Z值向前推进,这样它们都可以相互独立地同步跨步。这为编译器和/或CPU恢复了足够的并行性。
poly(i) = A i^2 + B i + C
poly(i+s) = A (i+s)^2 + B (i+s) + C
= A*i^2 + A*2*s*i + A*s^2 + B*i + B*s + C
= poly(i) + A*2*s*i + A*s^2 + B*s + C
这有点乱,不太清楚如何把它分解成Y和Z部分。(这个答案的早期版本是错误的。)
可能更容易从一阶差分和二阶差分通过FP值序列(有限差分法)进行倒退。这将直接发现我们需要添加什么来继续前进;Z[]初始化式和步长。
这基本上就像求一阶导数和二阶导数,然后优化的循环有效地积分来恢复原始函数。下面的输出是由下面基准测试中主程序的正确性检查部分生成的。
# method of differences for stride=1, A=1, B=0, C=0
poly(i) 1st 2nd difference from this poly(i) to poly(i+1)
0 1
1 3 2 # 4-1 = 3 | 3-1 = 2
4 5 2 # 9-4 = 5 | 5-3 = 2
9 7 2 # ...
16 9 2
25 11 2
同样的多项式(x^2)但是步长为3。非2的幂有助于显示步幅的因子/幂,而不是自然发生的因子2。
# for stride of 3, printing in groups. A=1, B=0, C=0
poly(i) 1st 2nd difference from this poly(i) to poly(i+3)
0 9
1 15
4 21
9 27 18 # 36- 9 = 27 | 27-9 = 18
16 33 18 # 49-16 = 33 | 33-15 = 18
25 39 18 # ...
36 45 18 # 81-36 = 45 | 45-27 = 18
49 51 18
64 57 18
81 63 18
100 69 18
121 75 18
Y[]和Z[]初始化
The initial Y[j] = poly(j) because it has to get stored to the output at the corresponding position (data[i+j] = Y[j]).
The initial Z[j] will get added to Y[j], and needs to make it into poly(j+stride). Thus the initial Z[j] = poly(j+stride) - Y[j], which we can then simplify algebraically if we want. (For compile-time constant A,B,C, the compiler will constant-propagate either way.)
Z[j] holds the first-order differences in striding through poly(x), for starting points of poly(0..stride-1). This is the middle column in the above table.
The necessary update to Z[j] += second_difference is a scalar constant, as we can see from the second-order differences being the same.
By playing around with a couple different stride and A values (coefficient of i^2), we can see that it's A * 2 * (stride * stride). (Using non-coprime values like 3 and 5 helps to disentangle things.) With more algebra, you could show this symbolically. The factor of 2 makes sense from a calculus PoV: d(A*x^2)/dx = 2Ax, and the 2nd derivative is 2A.
// Tested and correct for a few stride and coefficient values.
#include <stdalign.h>
#include <stdlib.h>
#define LEN 1024
alignas(64) double data[LEN];
//static const double A = 1, B = 0, C = 0; // for easy testing
static const double A = 5, B = 3, C = 7; // can be function args
void compute2(double * const __restrict__ data)
{
const int stride = 16; // unroll factor. 1 reduces to the original
const double diff2 = (stride * stride) * 2 * A; // 2nd-order differences
double Z[stride], Y[stride];
for (int j = 0 ; j<stride ; j++){ // this loop will fully unroll
Y[j] = j*j*A + j*B + C; // poly(j) starting values to increment
//Z[j] = (j+stride)*(j+stride)*A + (j+stride)*B + C - Y[j];
//Z[j] = 2*j*stride*A + stride*stride*A + stride*B;
Z[j] = ((2*j + stride)*A + B)*stride; // 1st-difference to next Y[j], from this to the next i
}
for(ptrdiff_t i=0; i < LEN - (stride-1); i+=stride) {
// loops that are easy(?) for a compiler to roll up into some SIMD vectors
for (int j=0 ; j<stride ; j++) data[i+j] = Y[j]; // store
for (int j=0 ; j<stride ; j++) Y[j] += Z[j]; // add
for (int j=0 ; j<stride ; j++) Z[j] += diff2; // add
}
// cleanup for the last few i values
for (int j = 0 ; j < LEN % stride ; j++) {
// let the compiler see LEN%stride to help it decide *not* to auto-vectorize this part
//size_t i = LEN - (stride-1) + j;
//data[i] = poly(i);
}
}
对于stride=1(不展开),这些值简化为原始值。但是如果使用更大的stride,编译器可以将Y[]和Z[]的元素分别保存在几个SIMD向量中,因为每个Y[j]只与相应的Z[j]交互。
编译器(SIMD)和CPU(流水线执行单元)可以利用跨步独立的并行依赖链,运行跨步时间比原来的快,直到SIMD fp瓶颈—增加吞吐量而不是延迟,或者如果缓冲区不适合L1d,则存储带宽。(或者直到编译器面植并且没有很好地展开和向量化这些循环!)
这是如何在实践中编译的:很好地与clang
(Godbolt编译器浏览器)Clang自动向量化好stride=16(4倍YMM矢量每个)与clang14 -O3 -march=skylake - fast-math。
看起来clang已经进一步展开了2,捷径Z[j] += diff2到tmp = Z[j] + diff2;/ Z[j] += 2*diff2;这减轻了Z依赖链的压力,只剩下Y[j]在Skylake上面临延迟瓶颈。
所以每次asm循环迭代要执行2x 8个vaddpd指令和2x 4个存储。循环开销是add + macrofused cmp/jne,所以2个uops。(或者对于一个全局数组,只需要一个add/jne uop,将一个负索引数到0;它相对于数组的末尾进行索引。)
Skylake在几乎1家商店和2倍vaddpd每个时钟周期运行。这是两者的最大吞吐量。前端只需要保持超过3个uops /时钟周期,但从Core2开始就已经是4个了。
Sandy bridge家族中的uop缓存可以解决这个问题。(除非你在Skylake上遇到了JCC勘误表,所以我使用-mbranch -within- 32b -boundaries来有叮当垫指令来避免这种情况。)
With Skylake's vaddpd latency of 4 cycles, 4 dependency chains from stride=16 is just barely enough to keep 4 independent operations in flight. Any time a Y[j]+= doesn't run the cycle it's ready, that creates a bubble. Thanks to Clang's extra unroll of the Z[] chain, a Z[j]+= could then run early, so the Z chain can get ahead. With oldest-ready-first scheduling, it tends to settle down into a state where Yj+= uops don't have conflicts, apparently, since it does does run at full speed on my Skylake. If we could get the compiler to still make nice asm for stride=32, that would leave more room, but unfortunately it doesn't. (At a cost of more cleanup work for odd sizes.)
Clang奇怪地只用- fast-math向量化。下面完整基准测试中的模板版本不需要- fast-math。源代码经过精心编写,对simd友好,并按照源代码顺序进行数学操作。(不过,快速数学是允许clang更多地展开Z增量的方法。)
另一种写循环的方法是用一个内循环代替所有的Y运算,然后是所有的Z运算。这在下面的基准测试中很好(有时实际上更好),但在这里,即使使用- fast-math,它也没有向量化。从编译器中获得像这样的重要问题的最优展开SIMD asm可能非常繁琐和不可靠,而且可能需要一些实践。
我将它包含在Godbolt的#if 0 / #else / #endif块中。
// can auto-vectorize better or worse than the other way
// depending on compiler and surrounding code.
for(int i=0; i < LEN - (stride-1); i+=stride) {
for (int j = 0 ; j<stride ; j++){
data[i+j] = Y[j];
Y[j] += Z[j];
Z[j] += deriv2;
}
}
我们必须手动选择一个合适的展开量。太大的展开因子甚至会阻止编译器看到正在发生什么,并停止将临时数组保存在寄存器中。例如,32或24是clang的问题,但不是16。可能会有一些调优选项来强制编译器展开循环到一定数量;对于GCC来说,有时可以使用它在编译时看穿一些东西。
另一种方法是手动向量化,使用#include < imminrin .h>和__m256d Z[4]代替双Z[16]。但是这个版本可以向量化其他isa,比如AArch64。
大展开系数的另一个缺点是,当问题规模不是展开系数的几倍时,需要进行更多的清理工作。(您可以使用compute1策略进行清理,让编译器在执行标量之前对其进行一两次迭代的向量化。)
理论上,编译器可以用- fast-math为你做这件事,或者从compute1做原始多项式上的强度约减,或者从compute2看步数是如何累积的。
但实际上,这是非常复杂的,人类必须自己去做。除非/直到有人教编译器如何寻找这样的模式,并应用差异本身的方法,并选择一个步法!但是,即使使用- fast-math,对具有不同错误累积属性的算法进行大规模重写也是不可取的。(Integer没有任何精度问题,但它仍然是一个复杂的模式匹配/替换。)
实验性能结果:
我在我的台式机(i7-6700k)上使用clang13.0.0进行了测试。这实际上在每个时钟周期运行1个SIMD存储,并使用编译器选项的几种组合(快速数学或非快速数学)以及内部循环策略上的#if 0和#if 1。我的基准/测试框架基于@huseyin tugrul buyukisik的版本,改进为在rdtsc指令之间重复更可测量的量,并使用一个测试循环来检查多项式的简单计算的正确性。
我还让它补偿了核心时钟频率和rdtsc读取的TSC“参考”频率之间的差异,在我的例子中是3.9 GHz vs. 4008 MHz。(额定最大turbo是4.2 GHz,但在Linux上使用EPP = balance_performance,它只希望时钟达到3.9 GHz。)
Godbolt的源代码:使用一个内部循环,而不是3个单独的j<16循环,并且不使用- fast-math。使用__attribute__((noinline))来防止它内联到repeat循环中。选项和源代码的其他一些变化导致循环内部的一些vpermpd洗牌。
下面的基准数据来自一个有bug的Z[j]初始化器的旧版本,但同样的循环asm。Godbolt链接现在在计时循环通过后进行了正确性测试。在我的桌面上,实际性能仍然是一样的,每double只超过0.25个循环,甚至没有#if 1 / - fast-math来允许clang额外展开。
$ clang++ -std=gnu++17 -O3 -march=native -mbranches-within-32B-boundaries poly-eval.cpp -Wall
# warning about noipa, only GCC knows that attribute
$ perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,fp_arith_inst_retired.256b_packed_double -r10 ./a.out
... (10 runs of the whole program, ending with)
...
0.252295 cycles per data element (corrected from ref cycles to core clocks for i7-6700k @ 3.9 GHz)
0.252109 cycles per data element (corrected from ref cycles to core clocks for i7-6700k @ 3.9 GHz)
xor=4303
min cycles per data = 0.251868
Performance counter stats for './a.out' (10 runs):
298.92 msec task-clock # 0.989 CPUs utilized ( +- 0.49% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
129 page-faults # 427.583 /sec ( +- 0.56% )
1,162,430,637 cycles # 3.853 GHz ( +- 0.49% ) # time spent in the kernel for system calls and interrupts isn't counted, that's why it's not 3.90 GHz
3,772,516,605 instructions # 3.22 insn per cycle ( +- 0.00% )
3,683,072,459 uops_issued.any # 12.208 G/sec ( +- 0.00% )
4,824,064,881 uops_executed.thread # 15.990 G/sec ( +- 0.00% )
2,304,000,000 fp_arith_inst_retired.256b_packed_double # 7.637 G/sec
0.30210 +- 0.00152 seconds time elapsed (+- 0.50%)
fp_arith_inst_retired。256b_packed_double为每个FP add或mul指令计算1 (FMA为2),因此我们在整个程序中每个时钟周期获得1.98个vaddpd指令,包括打印等。这非常接近理论最大值2/时钟,显然没有受到次优uop调度的影响。(我增加了重复循环,所以程序花了大部分的时间在那里,使整个程序的性能统计有用。)
这一优化的目标是用更少的FLOPS完成相同的工作,但这也意味着我们在不使用FMA的情况下将《Skylake》的FLOP/时钟限制最大化。(单内核3.9 GHz时为30.58 GFLOP/s)。
非内联函数的Asm (objdump -drwC -Mintel);clang使用了4对Y,Z对YMM向量,并将循环进一步展开3x,使其成为24 KiB大小的精确倍数,无需清理。注意添加rax,0x30做3 * stride=0x10每次迭代翻倍。
0000000000001440 <void compute2<3072>(double*)>:
# just loading constants; the setup loop did fully unroll and disappear
1440: c5 fd 28 0d 18 0c 00 00 vmovapd ymm1,YMMWORD PTR [rip+0xc18] # 2060 <_IO_stdin_used+0x60>
1448: c5 fd 28 15 30 0c 00 00 vmovapd ymm2,YMMWORD PTR [rip+0xc30] # 2080
1450: c5 fd 28 1d 48 0c 00 00 vmovapd ymm3,YMMWORD PTR [rip+0xc48] # 20a0
1458: c4 e2 7d 19 25 bf 0b 00 00 vbroadcastsd ymm4,QWORD PTR [rip+0xbbf] # 2020
1461: c5 fd 28 2d 57 0c 00 00 vmovapd ymm5,YMMWORD PTR [rip+0xc57] # 20c0
1469: 48 c7 c0 d0 ff ff ff mov rax,0xffffffffffffffd0
1470: c4 e2 7d 19 05 af 0b 00 00 vbroadcastsd ymm0,QWORD PTR [rip+0xbaf] # 2028
1479: c5 fd 28 f4 vmovapd ymm6,ymm4 # buggy Z[j] initialization in this ver used the same value everywhere
147d: c5 fd 28 fc vmovapd ymm7,ymm4
1481: c5 7d 28 c4 vmovapd ymm8,ymm4
1485: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 cs nop WORD PTR [rax+rax*1+0x0]
# top of outer loop. The NOP before this is to align it.
1490: c5 fd 11 ac c7 80 01 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x180],ymm5
1499: c5 d5 58 ec vaddpd ymm5,ymm5,ymm4
149d: c5 dd 58 e0 vaddpd ymm4,ymm4,ymm0
14a1: c5 fd 11 9c c7 a0 01 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x1a0],ymm3
14aa: c5 e5 58 de vaddpd ymm3,ymm3,ymm6
14ae: c5 cd 58 f0 vaddpd ymm6,ymm6,ymm0
14b2: c5 fd 11 94 c7 c0 01 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x1c0],ymm2
14bb: c5 ed 58 d7 vaddpd ymm2,ymm2,ymm7
14bf: c5 c5 58 f8 vaddpd ymm7,ymm7,ymm0
14c3: c5 fd 11 8c c7 e0 01 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x1e0],ymm1
14cc: c5 bd 58 c9 vaddpd ymm1,ymm8,ymm1
14d0: c5 3d 58 c0 vaddpd ymm8,ymm8,ymm0
14d4: c5 fd 11 ac c7 00 02 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x200],ymm5
14dd: c5 d5 58 ec vaddpd ymm5,ymm5,ymm4
14e1: c5 dd 58 e0 vaddpd ymm4,ymm4,ymm0
14e5: c5 fd 11 9c c7 20 02 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x220],ymm3
14ee: c5 e5 58 de vaddpd ymm3,ymm3,ymm6
14f2: c5 cd 58 f0 vaddpd ymm6,ymm6,ymm0
14f6: c5 fd 11 94 c7 40 02 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x240],ymm2
14ff: c5 ed 58 d7 vaddpd ymm2,ymm2,ymm7
1503: c5 c5 58 f8 vaddpd ymm7,ymm7,ymm0
1507: c5 fd 11 8c c7 60 02 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x260],ymm1
1510: c5 bd 58 c9 vaddpd ymm1,ymm8,ymm1
1514: c5 3d 58 c0 vaddpd ymm8,ymm8,ymm0
1518: c5 fd 11 ac c7 80 02 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x280],ymm5
1521: c5 d5 58 ec vaddpd ymm5,ymm5,ymm4
1525: c5 dd 58 e0 vaddpd ymm4,ymm4,ymm0
1529: c5 fd 11 9c c7 a0 02 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x2a0],ymm3
1532: c5 e5 58 de vaddpd ymm3,ymm3,ymm6
1536: c5 cd 58 f0 vaddpd ymm6,ymm6,ymm0
153a: c5 fd 11 94 c7 c0 02 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x2c0],ymm2
1543: c5 ed 58 d7 vaddpd ymm2,ymm2,ymm7
1547: c5 c5 58 f8 vaddpd ymm7,ymm7,ymm0
154b: c5 fd 11 8c c7 e0 02 00 00 vmovupd YMMWORD PTR [rdi+rax*8+0x2e0],ymm1
1554: c5 bd 58 c9 vaddpd ymm1,ymm8,ymm1
1558: c5 3d 58 c0 vaddpd ymm8,ymm8,ymm0
155c: 48 83 c0 30 add rax,0x30
1560: 48 3d c1 0b 00 00 cmp rax,0xbc1
1566: 0f 82 24 ff ff ff jb 1490 <void compute2<3072>(double*)+0x50>
156c: c5 f8 77 vzeroupper
156f: c3 ret
相关:
Latency bounds and throughput bounds for processors for operations that must occur in sequence - analysis of code with two dependency chains, one reading from the other and earlier in itself. Same dependency pattern as the strength-reduced loop, except one of its chains is an FP multiply. (It's also a polynomial evaluation scheme, but for one large polynomial.)
SIMD optimization of a curve computed from the second derivative another case of being able to stride along the serial dependency.
Is it possible to use SIMD on a serial dependency in a calculation, like an exponential moving average filter? - If there's a closed-form formula for n steps ahead, you can use that to sidestep serial dependencies.
Out of Order Execution, How to Solve True Dependency? - CPUs have to wait when an instruction depends on one that hasn't executed yet.
Dependency chain analysis chain analysis, from one of Agner Fog's examples.
Modern Microprocessors
A 90-Minute Guide! - general background on out-of-order exec and pipelines. Modern CPU-style short-vector SIMD exists in this form to get more work through the pipeline of a single CPU without widening the pipeline. By contrast, GPUs have many simple pipelines.
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) - Some experimental numbers with unrolling to hide the latency of FP dependency chains, and some CPU-architecture background on register renaming.