我想知道2007年Ulrich Drepper的《每个程序员都应该知道关于内存的什么》一书中有多少内容仍然有效。此外,我也找不到比1.0更新的版本或勘误表。
(在Ulrich Drepper自己的网站上也有PDF格式:https://www.akkadia.org/drepper/cpumemory.pdf)
总的来说,它仍然是优秀的,并强烈推荐(我,我认为其他性能调优专家)。如果Ulrich(或其他人)写一个2017年的更新,那将是很酷的,但那将是大量的工作(例如重新运行基准测试)。参见x86标签维基中的其他x86性能调优和SSE/asm(和C/ c++)优化链接。(Ulrich的文章不是x86专用的,但他的大多数(所有)基准测试都是在x86硬件上进行的。)
关于DRAM和缓存如何工作的底层硬件细节仍然适用。DDR4使用与DDR1/DDR2相同的命令(读/写突发)。DDR3/4的改进并不是根本性的改变。AFAIK,所有与arch无关的东西仍然普遍适用,例如AArch64 / ARM32。
See also the Latency Bound Platforms section of this answer for important details about the effect of memory/L3 latency on single-threaded bandwidth: bandwidth <= max_concurrency / latency, and this is actually the primary bottleneck for single-threaded bandwidth on a modern many-core CPU like a Xeon. But a quad-core Skylake desktop can come close to maxing out DRAM bandwidth with a single thread. That link has some very good info about NT stores vs. normal stores on x86. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? is a summary.
Thus Ulrich's suggestion in 6.5.8 Utilizing All Bandwidth about using remote memory on other NUMA nodes as well as your own, is counter-productive on modern hardware where memory controllers have more bandwidth than a single core can use. Well possibly you can imagine a situation where there's a net benefit to running multiple memory-hungry threads on the same NUMA node for low-latency inter-thread communication, but having them use remote memory for high bandwidth not-latency-sensitive stuff. But this is pretty obscure, normally just divide threads between NUMA nodes and have them use local memory. Per-core bandwidth is sensitive to latency because of max-concurrency limits (see below), but all the cores in one socket can usually more than saturate the memory controllers in that socket.
One major thing that's changed is that hardware prefetch is much better than on the Pentium 4 and can recognize strided access patterns up to a fairly large stride, and multiple streams at once (e.g. one forward / backward per 4k page). Intel's optimization manual describes some details of the HW prefetchers in various levels of cache for their Sandybridge-family microarchitecture. Ivybridge and later have next-page hardware prefetch, instead of waiting for a cache miss in the new page to trigger a fast-start. I assume AMD has some similar stuff in their optimization manual. Beware that Intel's manual is also full of old advice, some of which is only good for P4. The Sandybridge-specific sections are of course accurate for SnB, but e.g. un-lamination of micro-fused uops changed in HSW and the manual doesn't mention it.
The suggestion to use a separate prefetch thread (6.3.4) is totally obsolete, I think, and was only ever good on Pentium 4. P4 had hyperthreading (2 logical cores sharing one physical core), but not enough trace-cache (and/or out-of-order execution resources) to gain throughput running two full computation threads on the same core. But modern CPUs (Sandybridge-family and Ryzen) are much beefier and should either run a real thread or not use hyperthreading (leave the other logical core idle so the solo thread has the full resources instead of partitioning the ROB).
Software prefetch has always been "brittle": the right magic tuning numbers to get a speedup depend on the details of the hardware, and maybe system load. Too early and it's evicted before the demand load. Too late and it doesn't help. This blog article shows code + graphs for an interesting experiment in using SW prefetch on Haswell for prefetching the non-sequential part of a problem. See also How to properly use prefetch instructions?. NT prefetch is interesting, but even more brittle because an early eviction from L1 means you have to go all the way to L3 or DRAM, not just L2. If you need every last drop of performance, and you can tune for a specific machine, SW prefetch is worth looking at for sequential access, but it may still be a slowdown if you have enough ALU work to do while coming close to bottlenecking on memory.
缓存线大小仍然是64字节。(L1D读/写带宽非常高,现代cpu每个时钟可以执行2个矢量负载+ 1个矢量存储,如果它都在L1D中命中。看看缓存怎么能那么快?)对于AVX512,行大小=向量宽度,所以你可以在一条指令中加载/存储整个缓存行。因此,对于256b AVX1/AVX2,每一个未对齐的加载/存储都跨越了缓存线边界,而不是其他的,这通常不会减慢在L1D之外的数组上的循环。
Skylake-X (AVX512)不再有一个包容的L3,但我认为仍然有一个标签目录,可以让它检查芯片上缓存的任何地方(如果有的话,在哪里),而不实际广播到所有的核心。SKX使用网格总线而不是环形总线,不幸的是,它的延迟通常比以前的多核xeon还要糟糕。
c++ 11 std::atomic fetch_add将编译为一个锁add(如果使用返回值,则锁定xadd),但是使用CAS来做一些无法用锁定指令完成的事情的算法通常不是灾难。使用c++ 11 std::atomic或C11 stomic,而不是gcc遗留的__sync内置程序或更新的__atomic内置程序,除非你想混合对同一位置的原子和非原子访问……
8.1 DWCAS (cmpxchg16b):你可以哄骗gcc释放它,但如果你想要有效地只加载对象的一半,你需要丑陋的联合:我如何用c++11 CAS实现ABA计数器?(不要将DWCAS与两个独立内存位置的DCAS混淆。DCAS的无锁原子模拟在DWCAS中是不可能实现的,但是事务性内存(如x86 TSX)使之成为可能。)
8.2.4 transactional memory: After a couple false starts (released then disabled by a microcode update because of a rarely-triggered bug), Intel has working transactional memory in late-model Broadwell and all Skylake CPUs. The design is still what David Kanter described for Haswell. There's a lock-elision way to use it to speed up code that uses (and can fall back to) a regular lock (especially with a single lock for all elements of a container so multiple threads in the same critical section often don't collide), or to write code that knows about transactions directly.
更新:现在英特尔已经通过微码更新在后来的cpu(包括Skylake)上禁用了锁省略。如果操作系统允许的话,TSX的RTM (xbegin / xend)非透明部分仍然可以工作,但总的来说,TSX正在严重地变成查理·布朗的足球。
7.5 Hugepages:匿名透明的Hugepages在Linux上运行良好,无需手动使用hugetlbfs。使分配>= 2MiB与2MiB对齐(例如posix_memalign,或aligned_alloc,它不强制执行愚蠢的ISO c++ 17要求,当size % alignment != 0时失败)。
A 2MiB-aligned anonymous allocation will use hugepages by default. Some workloads (e.g. that keep using large allocations for a while after making them) may benefit from echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag to get the kernel to defrag physical memory whenever needed, instead of falling back to 4k pages. (See the kernel docs). Use madvise(MADV_HUGEPAGE) after making large allocations (preferably still with 2MiB alignment) to more strongly encourage the kernel to stop and defrag now. defrag = always is too aggressive for most workloads and will spend more time copying pages around than it saves in TLB misses. (kcompactd could maybe be more efficient.)
附录B: Oprofile: Linux perf已经基本取代了Oprofile。性能列表/性能统计-e event1,event2…提供了编程HW性能计数器的大多数有用方法的名称。
perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,\
uops_executed.thread,idq_uops_not_delivered.core -r2 ./a.out
有关使用它的一些例子,请参见Can x86的MOV真的是“免费的”?为什么我完全不能复制这个?