在Intel sandybridge系列cpu中对管道的程序进行反优化

重要背景阅读:Agner Fog的microarch pdf，可能还有Ulrich Drepper的《每个程序员都应该知道内存》。参见x86标签wiki中的其他链接，特别是Intel的优化手册，以及David Kanter对Haswell微架构的分析，并附有图表。

非常酷的任务;比我看到的那些要求学生为gcc -O0优化一些代码，学习一堆在实际代码中无关紧要的技巧的方法要好得多。在这种情况下，要求您了解CPU管道，并使用它来指导反优化工作，而不仅仅是盲目的猜测。这本书最有趣的部分是用“恶魔般的无能”来证明每一种悲观情绪，而不是故意的恶意。

作业的措辞和代码问题:

这段代码的特定于uarch的选项是有限的。它不使用任何数组，大部分开销是调用exp/log库函数。没有一种明显的方法来获得或多或少的指令级并行性，并且循环携带的依赖链非常短。

仅仅通过重新排列表达式来改变依赖关系，以减少ILP带来的危险，很难获得减速。

英特尔sandybridge系列cpu是一种激进的乱序设计，它花费了大量的晶体管和功率来寻找并行性，并避免会困扰经典RISC有序管道的危险(依赖)。通常唯一的传统风险是RAW“真正的”依赖关系，导致吞吐量受到延迟的限制。

由于寄存器重命名，寄存器的WAR和WAW危险几乎不是问题。(popcnt/lzcnt/tzcnt除外，它们的目标依赖于Intel cpu，即使它应该只写)。

对于内存排序，现代cpu使用存储缓冲区来延迟提交到缓存，直到退役，也避免了WAR和WAW的危险。另请参阅关于什么是存储缓冲区的回答，以及OoO exec将执行与其他核心可以看到的东西分离的必要条件。

为什么mulss在Haswell上只需要3个周期，不同于Agner的指令表?(使用多个累加器展开FP循环)有更多关于寄存器重命名和在FP点积循环中隐藏FMA延迟的内容。

The "i7" brand-name was introduced with Nehalem (successor to Core2), and some Intel manuals even say Core i7 when they seem to mean Nehalem, but they kept the "i7" branding for Sandybridge and later microarchitectures. SnB is when the P6-family evolved into a new species, the SnB-family. In many ways, Nehalem has more in common with Pentium III than with Sandybridge (e.g. register read stalls aka ROB-read stalls don't happen on SnB, because it changed to using a physical register file. Also a uop cache and a different internal uop format). The term "i7 architecture" is not useful, because it makes little sense to group the SnB-family with Nehalem but not Core2. (Nehalem did introduce the shared inclusive L3 cache architecture for connecting multiple cores together, though. And also integrated GPUs. So chip-level, the naming makes more sense.)

恶魔般的无能可以证明的好想法的总结

即使是极其不称职的人也不太可能添加明显无用的工作或无限循环，而把c++ /Boost类搞得一团糟也超出了任务的范围。

Multi-thread with a single shared std::atomic<uint64_t> loop counter, so the right total number of iterations happen. Atomic uint64_t is especially bad with -m32 -march=i586. For bonus points, arrange for it to be misaligned, and crossing a page boundary with an uneven split (not 4:4). False sharing for some other non-atomic variable -> memory-order mis-speculation pipeline clears, as well as extra cache misses. Instead of using - on FP variables, XOR the high byte with 0x80 to flip the sign bit, causing store-forwarding stalls. Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly. Change multiplies by constants to divides by their reciprocal ("for ease of reading"). div is slow and not fully pipelined. Vectorize the multiply/sqrt with AVX (SIMD), but fail to use vzeroupper before calls to scalar math-library exp() and log() functions, causing AVX<->SSE transition stalls. Store the RNG output in a linked list, or in arrays which you traverse out of order. Same for the result of each iteration, and sum at the end.

在这个答案中还包括，但在总结中不包括:在非流水线CPU上同样缓慢的建议，或者即使是恶魔般的无能也似乎不合理。例如，许多蹩脚的编译器想法会产生明显不同/更糟糕的asm。

多线程严重

也许可以使用OpenMP以很少的迭代实现多线程循环，开销远远大于速度增益。你的蒙特卡罗代码有足够的并行度来实现加速，特别是如果我们成功地使每次迭代变慢的话。(每个线程计算部分payoff_sum，添加在最后)。#omp并行可能是一个优化，而不是一个悲观。

Multi-thread but force both threads to share the same loop counter (with atomic increments so the total number of iterations is correct). This seems diabolically logical. This means using a static variable as a loop counter. This justifies use of atomic for loop counters, and creates actual cache-line ping-ponging (as long as the threads don't run on the same physical core with hyperthreading; that might not be as slow). Anyway, this is much slower than the un-contended case for lock xadd or lock dec. And lock cmpxchg8b to atomically increment a contended uint64_t on a 32bit system will have to retry in a loop instead of having the hardware arbitrate an atomic inc.

Also create false sharing, where multiple threads keep their private data (e.g. RNG state) in different bytes of the same cache line. (Intel tutorial about it, including perf counters to look at). There's a microarchitecture-specific aspect to this: Intel CPUs speculate on memory mis-ordering not happening, and there's a memory-order machine-clear perf event to detect this, at least on P4. The penalty might not be as large on Haswell. As that link points out, a locked instruction assumes this will happen, avoiding mis-speculation. A normal load speculates that other cores won't invalidate a cache line between when the load executes and when it retires in program-order (unless you use pause). True sharing without locked instructions is usually a bug. It would be interesting to compare a non-atomic shared loop counter with the atomic case. To really pessimize, keep the shared atomic loop counter, and cause false sharing in the same or a different cache line for some other variable.

随机uarch特定的想法:

如果您可以引入任何不可预知的分支，那将大大降低代码的效率。现代x86 cpu有相当长的管道，所以一个错误的预测花费大约15个周期(当从uop缓存运行时)。

依赖链:

我认为这是作业的一部分。

通过选择具有一个较长的依赖链而不是多个较短的依赖链的操作顺序来克服CPU利用指令级并行的能力。除非使用-ffast-math，否则编译器不允许更改FP计算的操作顺序，因为这会改变结果(如下所述)。

To really make this effective, increase the length of a loop-carried dependency chain. Nothing leaps out as obvious, though: The loops as written have very short loop-carried dependency chains: just an FP add. (3 cycles). Multiple iterations can have their calculations in-flight at once, because they can start well before the payoff_sum += at the end of the previous iteration. (log() and exp take many instructions, but not a lot more than Haswell's out-of-order window for finding parallelism: ROB size=192 fused-domain uops, and scheduler size=60 unfused-domain uops. As soon as execution of the current iteration progresses far enough to make room for instructions from the next iteration to issue, any parts of it that have their inputs ready (i.e. independent/separate dep chain) can start executing when older instructions leave the execution units free (e.g. because they're bottlenecked on latency, not throughput.).

RNG状态几乎肯定是一个比addps更长的循环携带依赖链。

使用更慢/更多的FP操作(特别是更多的除法):

除以2.0而不是乘以0.5，以此类推。FP乘法在英特尔的设计中被大量流水线化，并且在Haswell和以后的产品中每0.5c的吞吐量有一个。FP divsd/divpd只是部分流水线。(尽管Skylake在divpd xmm上有一个令人印象深刻的每4c的吞吐量，有13-14c的延迟，而在Nehalem上根本没有流水线(7-22c))。

do{…;Euclid_sq = x*x + y*y;} while (euclid_sq >= 1.0);显然是在测试一个距离，所以显然应该对它进行sqrt()。:P (sqrt甚至比div还要慢)。

正如@Paul Clayton所建议的，用关联/分布等价重写表达式可以引入更多的工作(只要你不使用- fast-math来允许编译器重新优化)。(exp (T * (r - 0.5 * v *))可能成为exp (T * r - T * v * v / 2.0)。注意，虽然实数数学是结合的，但浮点数学不是，即使不考虑overflow/NaN(这就是- fast-math默认不开启的原因)。请参阅Paul的评论，了解非常复杂的pow()嵌套建议。

如果你可以将计算缩小到非常小的数字，那么FP数学运算需要大约120个额外的周期来捕获到微码，当对两个正常数的运算产生一个非正常数时。具体数字和细节请参见Agner Fog的microarch pdf。这是不可能的，因为你有很多乘数，所以比例因子会是平方，一直下溢到0.0。我看不出有任何方法可以证明这种必要的扩张是无能的(甚至是邪恶的)，只有故意的恶意。

###如果你可以使用intrinsic (< imminrin .h>)

使用movnti从缓存中删除数据。恶魔:它是新的，弱顺序的，所以应该让CPU运行得更快，对吗?或者看看这个链接的问题，在这种情况下，某人正处于这样做的危险中(对于只有一些位置是热点的分散写入)。Clflush可能没有恶意。

在FP数学运算之间使用整数洗牌来引起旁路延迟。

Mixing SSE and AVX instructions without proper use of vzeroupper causes large stalls in pre-Skylake (and a different penalty in Skylake). Even without that, vectorizing badly can be worse than scalar (more cycles spent shuffling data into/out of vectors than saved by doing the add/sub/mul/div/sqrt operations for 4 Monte-Carlo iterations at once, with 256b vectors). add/sub/mul execution units are fully pipelined and full-width, but div and sqrt on 256b vectors aren't as fast as on 128b vectors (or scalars), so the speedup isn't dramatic for double.

exp() and log() don't have hardware support, so that part would require extracting vector elements back to scalar and calling the library function separately, then shuffling the results back into a vector. libm is typically compiled to only use SSE2, so will use the legacy-SSE encodings of scalar math instructions. If your code uses 256b vectors and calls exp without doing a vzeroupper first, then you stall. After returning, an AVX-128 instruction like vmovsd to set up the next vector element as an arg for exp will also stall. And then exp() will stall again when it runs an SSE instruction. This is exactly what happened in this question, causing a 10x slowdown. (Thanks @ZBoson).

请参见Nathan Kurz使用Intel的数学lib与glibc的实验。未来的glibc将提供exp()的向量化实现，等等。

如果目标是pre-IvB，或Nehalem，尝试让gcc用16位或8位操作导致部分寄存器停顿，然后是32位或64位操作。在大多数情况下，gcc将在8位或16位操作后使用movzx，但在这里，gcc修改ah，然后读取ax

使用(内联)asm:

With (inline) asm, you could break the uop cache: A 32B chunk of code that doesn't fit in three 6uop cache lines forces a switch from the uop cache to the decoders. An incompetent ALIGN (like NASM's default) using many single-byte nops instead of a couple long nops on a branch target inside the inner loop might do the trick. Or put the alignment padding after the label, instead of before. :P This only matters if the frontend is a bottleneck, which it won't be if we succeeded at pessimizing the rest of the code.

使用自修改代码来触发管道清除(又名machine-nukes)。

LCP从16位指令开始，立即太大而不能容纳8位指令是不太可能有用的。SnB和以后的uop缓存意味着你只需要支付解码惩罚一次。在Nehalem(第一个i7)上，它可能适用于一个不适合28 uop循环缓冲区的循环。GCC有时会生成这样的指令，即使使用-mtune=intel，并且可以使用32位指令。

计时的一个常用习语是CPUID(序列化)然后RDTSC。使用CPUID/RDTSC分别计算每个迭代的时间，以确保RDTSC不会与前面的指令重新排序，这会大大降低速度。(在现实生活中，计算时间的聪明方法是一起计算所有迭代的时间，而不是分别计算每个迭代的时间并将它们相加)。

导致大量缓存丢失和其他内存减慢

Use a union { double d; char a[8]; } for some of your variables. Cause a store-forwarding stall by doing a narrow store (or Read-Modify-Write) to just one of the bytes. (That wiki article also covers a lot of other microarchitectural stuff for load/store queues). e.g. flip the sign of a double using XOR 0x80 on just the high byte, instead of a - operator. The diabolically incompetent developer may have heard that FP is slower than integer, and thus try to do as much as possible using integer ops. (A compiler could theoretically still compile this to an xorps with a constant like -, but for x87 the compiler would have to realize that it's negating the value and fchs or replace the next add with a subtract.)

如果使用-O3编译而不使用std::atomic，则使用volatile强制编译器实际存储/重新加载所有位置。全局变量(而不是局部变量)也会强制一些存储/重新加载，但c++内存模型的弱排序并不需要编译器一直溢出/重新加载到内存中。

用大结构体的成员替换局部变量，这样就可以控制内存布局。

在结构体中使用数组填充(并存储随机数，以证明它们的存在)。

选择您的内存布局，使所有内容都进入L1缓存中同一“集”的不同行。它只有8向结合律，也就是说，每个集合有8种“方式”。缓存线是64B。

Even better, put things exactly 4096B apart, since loads have a false dependency on stores to different pages but with the same offset within a page. Aggressive out-of-order CPUs use Memory Disambiguation to figure out when loads and stores can be reordered without changing the results, and Intel's implementation has false-positives that prevent loads from starting early. Probably they only check bits below the page offset so it can start before the TLB has translated the high bits from a virtual page to a physical page. As well as Agner's guide, see this answer, and a section near the end of @Krazy Glew's answer on the same question. (Andy Glew was an architect of Intel's PPro - P6 microarchitecture.) (Also related: https://stackoverflow.com/a/53330296 and https://github.com/travisdowns/uarch-bench/wiki/Memory-Disambiguation-on-Skylake)

使用__attribute__((wrapped))来让你错误地对齐变量，使它们跨越缓存线甚至页面边界。(因此，一个double的加载需要来自两条缓存线的数据)。在任何Intel i7 uarch中，不对齐的加载都没有惩罚，除非跨越缓存线和页线。缓存线分割仍然需要额外的周期。Skylake极大地降低了页面分割加载的代价，从100个循环减少到5个循环。(2.1.3节)。(并且可以并行地进行两页遍历)。

A page-split on an atomic<uint64_t> should be just about the worst case, esp. if it's 5 bytes in one page and 3 bytes in the other page, or anything other than 4:4. Even splits down the middle are more efficient for cache-line splits with 16B vectors on some uarches, IIRC. Put everything in a alignas(4096) struct __attribute((packed)) (to save space, of course), including an array for storage for the RNG results. Achieve the misalignment by using uint8_t or uint16_t for something before the counter.

如果你能让编译器使用索引寻址模式，这将击败uop微融合。也许可以使用#define将简单的标量变量替换为my_data[constant]。

如果你可以引入一个额外的间接级别，所以加载/存储地址不能提前知道，这可能会进一步恶化。

以不连续的顺序遍历数组

我认为我们可以首先提出引入数组的不恰当的理由:它让我们将随机数的生成与随机数的使用分开。每次迭代的结果也可以存储在一个数组中，以供以后求和(这是更糟糕的无能)。

For "maximum randomness", we could have a thread looping over the random array writing new random numbers into it. The thread consuming the random numbers could generate a random index to load a random number from. (There's some make-work here, but microarchitecturally it helps for load-addresses to be known early so any possible load latency can be resolved before the loaded data is needed.) Having a reader and writer on different cores will cause memory-ordering mis-speculation pipeline clears (as discussed earlier for the false-sharing case).

为了达到最大的悲观效果，循环数组的步幅为4096字节(即512个双精度)。如。

for (int i=0 ; i<512; i++)
    for (int j=i ; j<UPPER_BOUND ; j+=512)
        monte_carlo_step(rng_array[j]);

因此，访问模式是0,4096,8192，…， 8,4104, 8200，…… 16、4112、8208……

这是你访问一个2D数组，如double rng_array[MAX_ROWS][512]在错误的顺序(循环在行，而不是列在一个内循环，由@JesperJuhl建议)。如果恶魔般的无能可以证明2D数组具有这样的维度，那么普通的现实世界的无能很容易证明使用错误的访问模式的循环是正确的。这发生在现实生活中的真实代码中。

如果数组不是那么大，如果需要使用许多不同的页面，而不是重用相同的几个页面，可以调整循环边界。硬件预取不能跨页面(也不能/根本不能)工作。预取器可以在每个页面中跟踪一个正向流和一个反向流(就像这里发生的那样)，但只有在内存带宽还没有被非预取饱和时才会对它起作用。

这也会产生很多TLB错误，除非页面被合并成一个巨大的页面(Linux为使用mmap(MAP_ANONYMOUS)的malloc/new等匿名(不支持文件的)分配机会性地这样做)。

Instead of an array to store the list of results, you could use a linked list. Every iteration would require a pointer-chasing load (a RAW true dependency hazard for the load-address of the next load). With a bad allocator, you might manage to scatter the list nodes around in memory, defeating cache. With a bad toy allocator, it could put every node at the beginning of its own page. (e.g. allocate with mmap(MAP_ANONYMOUS) directly, without breaking up pages or tracking object sizes to properly support free).

这些并不是特定于微体系结构的，也与管道没有什么关系(其中大多数也会在非管道的CPU上减慢)。

有点跑题:让编译器生成更糟糕的代码/做更多的工作:

对于最糟糕的代码，使用c++ 11 std::atomic<int>和std::atomic<double>。即使没有来自另一个线程的争用，MFENCEs和锁定的指令也非常慢。

-m32会使代码变慢，因为x87代码会比SSE2代码更糟糕。基于堆栈的32位调用约定需要更多指令，甚至将堆栈上的FP参数传递给像exp()这样的函数。atomic<uint64_t>::operator++ on -m32需要一个锁cmpxchg8B循环(i586)。(所以使用循环计数器!(邪恶的笑))。

-march=i386也会悲观(谢谢@Jesper)。FP与fcom相比比686 fcomi慢。Pre-586不提供原子的64位存储(更不用说cmpxchg)，因此所有64位原子操作都编译为libgcc函数调用(这可能是为i686编译的，而不是实际使用锁)。在最后一段的Godbolt Compiler Explorer链接中尝试一下。

使用long double / sqrtl / expl在sizeof(long double)为10或16的abi中获得额外的精度和额外的慢度(带有用于对齐的填充)。(IIRC, 64bit Windows使用8字节长的double等价于double。(无论如何，10byte (80bit) FP操作数的加载/存储是4/ 7 uop，而对于fld m64/m32/fst, float或double只需要1 uop)。强制x87与长双失败自动向量化甚至gcc -m64 -march=haswell -O3。

如果不使用atomic<uint64_t>循环计数器，则所有内容都使用long double，包括循环计数器。

原子<double>编译，但是不支持像+=这样的读-修改-写操作(即使在64位上)。Atomic <long double>必须调用仅用于原子加载/存储的库函数。这可能真的很低效，因为x86 ISA并不自然地支持原子的10字节加载/存储，而且我能想到的没有锁定的唯一方法(cmpxchg16b)需要64位模式。

在-O0时，通过将部分分配给临时变量来分解一个大表达式将导致更多的存储/重新加载。如果没有volatile之类的东西，这与实际代码的实际构建所使用的优化设置无关。

C语言的混叠规则允许用char来混叠任何东西，因此通过char*来存储会迫使编译器存储/重新加载字节存储之前/之后的所有东西，甚至在-O3。(这是一个自动向量化代码的问题，操作uint8_t数组，例如。)

尝试uint16_t循环计数器，强制截断到16位，可能通过使用16位操作数-大小(潜在的摊位)和/或额外的movzx指令(安全)。有符号溢出是未定义的行为，所以除非你使用-fwrapv或至少-fno-strict-overflow，否则有符号循环计数器不必每次迭代都重新进行签名扩展，即使是用作64位指针的偏移量。

强制从整数转换为浮点数，然后再转换回来。和/或双<=>浮点转换。指令有延迟> 1，标量int->float (cvtsi2ss)设计得很糟糕，不会将xmm寄存器的其余部分归零。(出于这个原因，gcc插入了一个额外的pxor来打破依赖关系。)

Frequently set your CPU affinity to a different CPU (suggested by @Egwor). diabolical reasoning: You don't want one core to get overheated from running your thread for a long time, do you? Maybe swapping to another core will let that core turbo to a higher clock speed. (In reality: they're so thermally close to each other that this is highly unlikely except in a multi-socket system). Now just get the tuning wrong and do it way too often. Besides the time spent in the OS saving/restoring thread state, the new core has cold L2/L1 caches, uop cache, and branch predictors.

频繁引入不必要的系统调用会降低您的速度，无论它们是什么。尽管一些重要但简单的功能(如gettimeofday)可以在用户空间with中实现，而无需过渡到内核模式。(Linux上的glibc在内核的帮助下做到了这一点:内核在VDSO中导出代码+数据)。

关于系统调用开销的更多信息(包括返回用户空间后的缓存/TLB丢失，而不仅仅是上下文切换本身)，FlexSC论文对当前情况进行了一些很好的性能计数器分析，以及一个从大规模多线程服务器进程批量处理系统调用的建议。

2016-05-21 11:17:14

重要背景阅读:Agner Fog的microarch pdf，可能还有Ulrich Drepper的《每个程序员都应该知道内存》。参见x86标签wiki中的其他链接，特别是Intel的优化手册，以及David Kanter对Haswell微架构的分析，并附有图表。

非常酷的任务;比我看到的那些要求学生为gcc -O0优化一些代码，学习一堆在实际代码中无关紧要的技巧的方法要好得多。在这种情况下，要求您了解CPU管道，并使用它来指导反优化工作，而不仅仅是盲目的猜测。这本书最有趣的部分是用“恶魔般的无能”来证明每一种悲观情绪，而不是故意的恶意。

作业的措辞和代码问题:

这段代码的特定于uarch的选项是有限的。它不使用任何数组，大部分开销是调用exp/log库函数。没有一种明显的方法来获得或多或少的指令级并行性，并且循环携带的依赖链非常短。

仅仅通过重新排列表达式来改变依赖关系，以减少ILP带来的危险，很难获得减速。

英特尔sandybridge系列cpu是一种激进的乱序设计，它花费了大量的晶体管和功率来寻找并行性，并避免会困扰经典RISC有序管道的危险(依赖)。通常唯一的传统风险是RAW“真正的”依赖关系，导致吞吐量受到延迟的限制。

由于寄存器重命名，寄存器的WAR和WAW危险几乎不是问题。(popcnt/lzcnt/tzcnt除外，它们的目标依赖于Intel cpu，即使它应该只写)。

对于内存排序，现代cpu使用存储缓冲区来延迟提交到缓存，直到退役，也避免了WAR和WAW的危险。另请参阅关于什么是存储缓冲区的回答，以及OoO exec将执行与其他核心可以看到的东西分离的必要条件。

为什么mulss在Haswell上只需要3个周期，不同于Agner的指令表?(使用多个累加器展开FP循环)有更多关于寄存器重命名和在FP点积循环中隐藏FMA延迟的内容。

The "i7" brand-name was introduced with Nehalem (successor to Core2), and some Intel manuals even say Core i7 when they seem to mean Nehalem, but they kept the "i7" branding for Sandybridge and later microarchitectures. SnB is when the P6-family evolved into a new species, the SnB-family. In many ways, Nehalem has more in common with Pentium III than with Sandybridge (e.g. register read stalls aka ROB-read stalls don't happen on SnB, because it changed to using a physical register file. Also a uop cache and a different internal uop format). The term "i7 architecture" is not useful, because it makes little sense to group the SnB-family with Nehalem but not Core2. (Nehalem did introduce the shared inclusive L3 cache architecture for connecting multiple cores together, though. And also integrated GPUs. So chip-level, the naming makes more sense.)

恶魔般的无能可以证明的好想法的总结

即使是极其不称职的人也不太可能添加明显无用的工作或无限循环，而把c++ /Boost类搞得一团糟也超出了任务的范围。

Multi-thread with a single shared std::atomic<uint64_t> loop counter, so the right total number of iterations happen. Atomic uint64_t is especially bad with -m32 -march=i586. For bonus points, arrange for it to be misaligned, and crossing a page boundary with an uneven split (not 4:4). False sharing for some other non-atomic variable -> memory-order mis-speculation pipeline clears, as well as extra cache misses. Instead of using - on FP variables, XOR the high byte with 0x80 to flip the sign bit, causing store-forwarding stalls. Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly. Change multiplies by constants to divides by their reciprocal ("for ease of reading"). div is slow and not fully pipelined. Vectorize the multiply/sqrt with AVX (SIMD), but fail to use vzeroupper before calls to scalar math-library exp() and log() functions, causing AVX<->SSE transition stalls. Store the RNG output in a linked list, or in arrays which you traverse out of order. Same for the result of each iteration, and sum at the end.

在这个答案中还包括，但在总结中不包括:在非流水线CPU上同样缓慢的建议，或者即使是恶魔般的无能也似乎不合理。例如，许多蹩脚的编译器想法会产生明显不同/更糟糕的asm。

多线程严重

也许可以使用OpenMP以很少的迭代实现多线程循环，开销远远大于速度增益。你的蒙特卡罗代码有足够的并行度来实现加速，特别是如果我们成功地使每次迭代变慢的话。(每个线程计算部分payoff_sum，添加在最后)。#omp并行可能是一个优化，而不是一个悲观。

Multi-thread but force both threads to share the same loop counter (with atomic increments so the total number of iterations is correct). This seems diabolically logical. This means using a static variable as a loop counter. This justifies use of atomic for loop counters, and creates actual cache-line ping-ponging (as long as the threads don't run on the same physical core with hyperthreading; that might not be as slow). Anyway, this is much slower than the un-contended case for lock xadd or lock dec. And lock cmpxchg8b to atomically increment a contended uint64_t on a 32bit system will have to retry in a loop instead of having the hardware arbitrate an atomic inc.

Also create false sharing, where multiple threads keep their private data (e.g. RNG state) in different bytes of the same cache line. (Intel tutorial about it, including perf counters to look at). There's a microarchitecture-specific aspect to this: Intel CPUs speculate on memory mis-ordering not happening, and there's a memory-order machine-clear perf event to detect this, at least on P4. The penalty might not be as large on Haswell. As that link points out, a locked instruction assumes this will happen, avoiding mis-speculation. A normal load speculates that other cores won't invalidate a cache line between when the load executes and when it retires in program-order (unless you use pause). True sharing without locked instructions is usually a bug. It would be interesting to compare a non-atomic shared loop counter with the atomic case. To really pessimize, keep the shared atomic loop counter, and cause false sharing in the same or a different cache line for some other variable.

随机uarch特定的想法:

如果您可以引入任何不可预知的分支，那将大大降低代码的效率。现代x86 cpu有相当长的管道，所以一个错误的预测花费大约15个周期(当从uop缓存运行时)。

依赖链:

我认为这是作业的一部分。

通过选择具有一个较长的依赖链而不是多个较短的依赖链的操作顺序来克服CPU利用指令级并行的能力。除非使用-ffast-math，否则编译器不允许更改FP计算的操作顺序，因为这会改变结果(如下所述)。

To really make this effective, increase the length of a loop-carried dependency chain. Nothing leaps out as obvious, though: The loops as written have very short loop-carried dependency chains: just an FP add. (3 cycles). Multiple iterations can have their calculations in-flight at once, because they can start well before the payoff_sum += at the end of the previous iteration. (log() and exp take many instructions, but not a lot more than Haswell's out-of-order window for finding parallelism: ROB size=192 fused-domain uops, and scheduler size=60 unfused-domain uops. As soon as execution of the current iteration progresses far enough to make room for instructions from the next iteration to issue, any parts of it that have their inputs ready (i.e. independent/separate dep chain) can start executing when older instructions leave the execution units free (e.g. because they're bottlenecked on latency, not throughput.).

RNG状态几乎肯定是一个比addps更长的循环携带依赖链。

使用更慢/更多的FP操作(特别是更多的除法):

除以2.0而不是乘以0.5，以此类推。FP乘法在英特尔的设计中被大量流水线化，并且在Haswell和以后的产品中每0.5c的吞吐量有一个。FP divsd/divpd只是部分流水线。(尽管Skylake在divpd xmm上有一个令人印象深刻的每4c的吞吐量，有13-14c的延迟，而在Nehalem上根本没有流水线(7-22c))。

do{…;Euclid_sq = x*x + y*y;} while (euclid_sq >= 1.0);显然是在测试一个距离，所以显然应该对它进行sqrt()。:P (sqrt甚至比div还要慢)。

正如@Paul Clayton所建议的，用关联/分布等价重写表达式可以引入更多的工作(只要你不使用- fast-math来允许编译器重新优化)。(exp (T * (r - 0.5 * v *))可能成为exp (T * r - T * v * v / 2.0)。注意，虽然实数数学是结合的，但浮点数学不是，即使不考虑overflow/NaN(这就是- fast-math默认不开启的原因)。请参阅Paul的评论，了解非常复杂的pow()嵌套建议。

如果你可以将计算缩小到非常小的数字，那么FP数学运算需要大约120个额外的周期来捕获到微码，当对两个正常数的运算产生一个非正常数时。具体数字和细节请参见Agner Fog的microarch pdf。这是不可能的，因为你有很多乘数，所以比例因子会是平方，一直下溢到0.0。我看不出有任何方法可以证明这种必要的扩张是无能的(甚至是邪恶的)，只有故意的恶意。

###如果你可以使用intrinsic (< imminrin .h>)

使用movnti从缓存中删除数据。恶魔:它是新的，弱顺序的，所以应该让CPU运行得更快，对吗?或者看看这个链接的问题，在这种情况下，某人正处于这样做的危险中(对于只有一些位置是热点的分散写入)。Clflush可能没有恶意。

在FP数学运算之间使用整数洗牌来引起旁路延迟。

Mixing SSE and AVX instructions without proper use of vzeroupper causes large stalls in pre-Skylake (and a different penalty in Skylake). Even without that, vectorizing badly can be worse than scalar (more cycles spent shuffling data into/out of vectors than saved by doing the add/sub/mul/div/sqrt operations for 4 Monte-Carlo iterations at once, with 256b vectors). add/sub/mul execution units are fully pipelined and full-width, but div and sqrt on 256b vectors aren't as fast as on 128b vectors (or scalars), so the speedup isn't dramatic for double.

exp() and log() don't have hardware support, so that part would require extracting vector elements back to scalar and calling the library function separately, then shuffling the results back into a vector. libm is typically compiled to only use SSE2, so will use the legacy-SSE encodings of scalar math instructions. If your code uses 256b vectors and calls exp without doing a vzeroupper first, then you stall. After returning, an AVX-128 instruction like vmovsd to set up the next vector element as an arg for exp will also stall. And then exp() will stall again when it runs an SSE instruction. This is exactly what happened in this question, causing a 10x slowdown. (Thanks @ZBoson).

请参见Nathan Kurz使用Intel的数学lib与glibc的实验。未来的glibc将提供exp()的向量化实现，等等。

如果目标是pre-IvB，或Nehalem，尝试让gcc用16位或8位操作导致部分寄存器停顿，然后是32位或64位操作。在大多数情况下，gcc将在8位或16位操作后使用movzx，但在这里，gcc修改ah，然后读取ax

使用(内联)asm:

With (inline) asm, you could break the uop cache: A 32B chunk of code that doesn't fit in three 6uop cache lines forces a switch from the uop cache to the decoders. An incompetent ALIGN (like NASM's default) using many single-byte nops instead of a couple long nops on a branch target inside the inner loop might do the trick. Or put the alignment padding after the label, instead of before. :P This only matters if the frontend is a bottleneck, which it won't be if we succeeded at pessimizing the rest of the code.

使用自修改代码来触发管道清除(又名machine-nukes)。

LCP从16位指令开始，立即太大而不能容纳8位指令是不太可能有用的。SnB和以后的uop缓存意味着你只需要支付解码惩罚一次。在Nehalem(第一个i7)上，它可能适用于一个不适合28 uop循环缓冲区的循环。GCC有时会生成这样的指令，即使使用-mtune=intel，并且可以使用32位指令。

计时的一个常用习语是CPUID(序列化)然后RDTSC。使用CPUID/RDTSC分别计算每个迭代的时间，以确保RDTSC不会与前面的指令重新排序，这会大大降低速度。(在现实生活中，计算时间的聪明方法是一起计算所有迭代的时间，而不是分别计算每个迭代的时间并将它们相加)。

导致大量缓存丢失和其他内存减慢

Use a union { double d; char a[8]; } for some of your variables. Cause a store-forwarding stall by doing a narrow store (or Read-Modify-Write) to just one of the bytes. (That wiki article also covers a lot of other microarchitectural stuff for load/store queues). e.g. flip the sign of a double using XOR 0x80 on just the high byte, instead of a - operator. The diabolically incompetent developer may have heard that FP is slower than integer, and thus try to do as much as possible using integer ops. (A compiler could theoretically still compile this to an xorps with a constant like -, but for x87 the compiler would have to realize that it's negating the value and fchs or replace the next add with a subtract.)

如果使用-O3编译而不使用std::atomic，则使用volatile强制编译器实际存储/重新加载所有位置。全局变量(而不是局部变量)也会强制一些存储/重新加载，但c++内存模型的弱排序并不需要编译器一直溢出/重新加载到内存中。

用大结构体的成员替换局部变量，这样就可以控制内存布局。

在结构体中使用数组填充(并存储随机数，以证明它们的存在)。

选择您的内存布局，使所有内容都进入L1缓存中同一“集”的不同行。它只有8向结合律，也就是说，每个集合有8种“方式”。缓存线是64B。

Even better, put things exactly 4096B apart, since loads have a false dependency on stores to different pages but with the same offset within a page. Aggressive out-of-order CPUs use Memory Disambiguation to figure out when loads and stores can be reordered without changing the results, and Intel's implementation has false-positives that prevent loads from starting early. Probably they only check bits below the page offset so it can start before the TLB has translated the high bits from a virtual page to a physical page. As well as Agner's guide, see this answer, and a section near the end of @Krazy Glew's answer on the same question. (Andy Glew was an architect of Intel's PPro - P6 microarchitecture.) (Also related: https://stackoverflow.com/a/53330296 and https://github.com/travisdowns/uarch-bench/wiki/Memory-Disambiguation-on-Skylake)

使用__attribute__((wrapped))来让你错误地对齐变量，使它们跨越缓存线甚至页面边界。(因此，一个double的加载需要来自两条缓存线的数据)。在任何Intel i7 uarch中，不对齐的加载都没有惩罚，除非跨越缓存线和页线。缓存线分割仍然需要额外的周期。Skylake极大地降低了页面分割加载的代价，从100个循环减少到5个循环。(2.1.3节)。(并且可以并行地进行两页遍历)。

A page-split on an atomic<uint64_t> should be just about the worst case, esp. if it's 5 bytes in one page and 3 bytes in the other page, or anything other than 4:4. Even splits down the middle are more efficient for cache-line splits with 16B vectors on some uarches, IIRC. Put everything in a alignas(4096) struct __attribute((packed)) (to save space, of course), including an array for storage for the RNG results. Achieve the misalignment by using uint8_t or uint16_t for something before the counter.

如果你能让编译器使用索引寻址模式，这将击败uop微融合。也许可以使用#define将简单的标量变量替换为my_data[constant]。

如果你可以引入一个额外的间接级别，所以加载/存储地址不能提前知道，这可能会进一步恶化。

以不连续的顺序遍历数组

我认为我们可以首先提出引入数组的不恰当的理由:它让我们将随机数的生成与随机数的使用分开。每次迭代的结果也可以存储在一个数组中，以供以后求和(这是更糟糕的无能)。

For "maximum randomness", we could have a thread looping over the random array writing new random numbers into it. The thread consuming the random numbers could generate a random index to load a random number from. (There's some make-work here, but microarchitecturally it helps for load-addresses to be known early so any possible load latency can be resolved before the loaded data is needed.) Having a reader and writer on different cores will cause memory-ordering mis-speculation pipeline clears (as discussed earlier for the false-sharing case).

为了达到最大的悲观效果，循环数组的步幅为4096字节(即512个双精度)。如。

for (int i=0 ; i<512; i++)
    for (int j=i ; j<UPPER_BOUND ; j+=512)
        monte_carlo_step(rng_array[j]);