我想知道2007年Ulrich Drepper的《每个程序员都应该知道关于内存的什么》一书中有多少内容仍然有效。此外,我也找不到比1.0更新的版本或勘误表。
(在Ulrich Drepper自己的网站上也有PDF格式:https://www.akkadia.org/drepper/cpumemory.pdf)
我想知道2007年Ulrich Drepper的《每个程序员都应该知道关于内存的什么》一书中有多少内容仍然有效。此外,我也找不到比1.0更新的版本或勘误表。
(在Ulrich Drepper自己的网站上也有PDF格式:https://www.akkadia.org/drepper/cpumemory.pdf)
据我所知,Drepper的内容描述了关于内存的基本概念:CPU缓存是如何工作的,什么是物理内存和虚拟内存以及Linux内核是如何处理这些动物园的。也许在一些例子中有过时的API引用,但没关系;这不会影响基本概念的相关性。
因此,任何描述基本事物的书籍或文章都不能被称为过时。“关于内存,每个程序员都应该知道什么”绝对值得一读,但是,好吧,我不认为它适合“每个程序员”。它更适合系统/嵌入式/内核的人。
从我的快速浏览来看,它看起来相当准确。需要注意的一件事是“集成”和“外部”内存控制器之间的区别。自从i7系列发布以来,英特尔的cpu都是集成的,而AMD自AMD64芯片首次发布以来一直使用集成内存控制器。
自从写完这篇文章以来,变化并不大,速度提高了,内存控制器变得更加智能(i7将延迟写入RAM,直到它觉得应该提交更改),但变化并不大。至少在软件开发人员不会关心的任何方面。
PDF格式的指南可以在https://www.akkadia.org/drepper/cpumemory.pdf上找到。
总的来说,它仍然是优秀的,并强烈推荐(我,我认为其他性能调优专家)。如果Ulrich(或其他人)写一个2017年的更新,那将是很酷的,但那将是大量的工作(例如重新运行基准测试)。参见x86标签维基中的其他x86性能调优和SSE/asm(和C/ c++)优化链接。(Ulrich的文章不是x86专用的,但他的大多数(所有)基准测试都是在x86硬件上进行的。)
关于DRAM和缓存如何工作的底层硬件细节仍然适用。DDR4使用与DDR1/DDR2相同的命令(读/写突发)。DDR3/4的改进并不是根本性的改变。AFAIK,所有与arch无关的东西仍然普遍适用,例如AArch64 / ARM32。
See also the Latency Bound Platforms section of this answer for important details about the effect of memory/L3 latency on single-threaded bandwidth: bandwidth <= max_concurrency / latency, and this is actually the primary bottleneck for single-threaded bandwidth on a modern many-core CPU like a Xeon. But a quad-core Skylake desktop can come close to maxing out DRAM bandwidth with a single thread. That link has some very good info about NT stores vs. normal stores on x86. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? is a summary.
Thus Ulrich's suggestion in 6.5.8 Utilizing All Bandwidth about using remote memory on other NUMA nodes as well as your own, is counter-productive on modern hardware where memory controllers have more bandwidth than a single core can use. Well possibly you can imagine a situation where there's a net benefit to running multiple memory-hungry threads on the same NUMA node for low-latency inter-thread communication, but having them use remote memory for high bandwidth not-latency-sensitive stuff. But this is pretty obscure, normally just divide threads between NUMA nodes and have them use local memory. Per-core bandwidth is sensitive to latency because of max-concurrency limits (see below), but all the cores in one socket can usually more than saturate the memory controllers in that socket.
(通常)不要使用软件预取
One major thing that's changed is that hardware prefetch is much better than on the Pentium 4 and can recognize strided access patterns up to a fairly large stride, and multiple streams at once (e.g. one forward / backward per 4k page). Intel's optimization manual describes some details of the HW prefetchers in various levels of cache for their Sandybridge-family microarchitecture. Ivybridge and later have next-page hardware prefetch, instead of waiting for a cache miss in the new page to trigger a fast-start. I assume AMD has some similar stuff in their optimization manual. Beware that Intel's manual is also full of old advice, some of which is only good for P4. The Sandybridge-specific sections are of course accurate for SnB, but e.g. un-lamination of micro-fused uops changed in HSW and the manual doesn't mention it.
现在通常的建议是从旧代码中删除所有SW预取,只有在分析显示缓存丢失时才考虑将其放回(并且您没有饱和内存带宽)。预取二进制搜索的下一个步骤的两边仍然有帮助。例如,一旦你决定接下来要看哪个元素,预取1/4和3/4元素,这样它们就可以在加载/检查中间并行加载。
The suggestion to use a separate prefetch thread (6.3.4) is totally obsolete, I think, and was only ever good on Pentium 4. P4 had hyperthreading (2 logical cores sharing one physical core), but not enough trace-cache (and/or out-of-order execution resources) to gain throughput running two full computation threads on the same core. But modern CPUs (Sandybridge-family and Ryzen) are much beefier and should either run a real thread or not use hyperthreading (leave the other logical core idle so the solo thread has the full resources instead of partitioning the ROB).
Software prefetch has always been "brittle": the right magic tuning numbers to get a speedup depend on the details of the hardware, and maybe system load. Too early and it's evicted before the demand load. Too late and it doesn't help. This blog article shows code + graphs for an interesting experiment in using SW prefetch on Haswell for prefetching the non-sequential part of a problem. See also How to properly use prefetch instructions?. NT prefetch is interesting, but even more brittle because an early eviction from L1 means you have to go all the way to L3 or DRAM, not just L2. If you need every last drop of performance, and you can tune for a specific machine, SW prefetch is worth looking at for sequential access, but it may still be a slowdown if you have enough ALU work to do while coming close to bottlenecking on memory.
缓存线大小仍然是64字节。(L1D读/写带宽非常高,现代cpu每个时钟可以执行2个矢量负载+ 1个矢量存储,如果它都在L1D中命中。看看缓存怎么能那么快?)对于AVX512,行大小=向量宽度,所以你可以在一条指令中加载/存储整个缓存行。因此,对于256b AVX1/AVX2,每一个未对齐的加载/存储都跨越了缓存线边界,而不是其他的,这通常不会减慢在L1D之外的数组上的循环。
如果地址在运行时对齐,未对齐的加载指令的惩罚为零,但是如果编译器(尤其是gcc)知道任何对齐保证,则在自动向量化时编写更好的代码。实际上,未对齐操作通常是快速的,但页面分割仍然会造成伤害(尽管Skylake的情况要少得多;只有大约11个额外的周期延迟,而不是100个,但仍然是吞吐量损失)。
正如Ulrich预测的那样,现在每个多插座系统都是NUMA:集成内存控制器是标准的,也就是说,没有外部北桥。但是SMP不再意味着多套接字,因为多核cpu已经广泛使用。从Nehalem到Skylake的英特尔cpu都使用了一个大的包含L3缓存作为内核之间一致性的后盾。AMD的cpu是不同的,但我不清楚细节。
Skylake-X (AVX512)不再有一个包容的L3,但我认为仍然有一个标签目录,可以让它检查芯片上缓存的任何地方(如果有的话,在哪里),而不实际广播到所有的核心。SKX使用网格总线而不是环形总线,不幸的是,它的延迟通常比以前的多核xeon还要糟糕。
基本上所有关于优化内存位置的建议仍然适用,只是当无法避免缓存丢失或争用时会发生什么细节有所不同。
6.4.2原子操作:显示cas重试循环比硬件仲裁锁添加差4倍的基准测试可能仍然反映了最大争用情况。但是在真正的多线程程序中,同步被保持到最小(因为它的成本很高),因此争用很低,cas重试循环通常不需要重试就能成功。
c++ 11 std::atomic fetch_add将编译为一个锁add(如果使用返回值,则锁定xadd),但是使用CAS来做一些无法用锁定指令完成的事情的算法通常不是灾难。使用c++ 11 std::atomic或C11 stomic,而不是gcc遗留的__sync内置程序或更新的__atomic内置程序,除非你想混合对同一位置的原子和非原子访问……
8.1 DWCAS (cmpxchg16b):你可以哄骗gcc释放它,但如果你想要有效地只加载对象的一半,你需要丑陋的联合:我如何用c++11 CAS实现ABA计数器?(不要将DWCAS与两个独立内存位置的DCAS混淆。DCAS的无锁原子模拟在DWCAS中是不可能实现的,但是事务性内存(如x86 TSX)使之成为可能。)
8.2.4 transactional memory: After a couple false starts (released then disabled by a microcode update because of a rarely-triggered bug), Intel has working transactional memory in late-model Broadwell and all Skylake CPUs. The design is still what David Kanter described for Haswell. There's a lock-elision way to use it to speed up code that uses (and can fall back to) a regular lock (especially with a single lock for all elements of a container so multiple threads in the same critical section often don't collide), or to write code that knows about transactions directly.
更新:现在英特尔已经通过微码更新在后来的cpu(包括Skylake)上禁用了锁省略。如果操作系统允许的话,TSX的RTM (xbegin / xend)非透明部分仍然可以工作,但总的来说,TSX正在严重地变成查理·布朗的足球。
硬件锁定省略会因为幽灵缓解而永远消失吗?(是的,但因为MDS类型的侧通道漏洞(TAA),而不是Spectre。我的理解是更新的微码完全禁用HLE。在这种情况下,操作系统只能启用RTM,不能启用HLE。)
7.5 Hugepages:匿名透明的Hugepages在Linux上运行良好,无需手动使用hugetlbfs。使分配>= 2MiB与2MiB对齐(例如posix_memalign,或aligned_alloc,它不强制执行愚蠢的ISO c++ 17要求,当size % alignment != 0时失败)。
A 2MiB-aligned anonymous allocation will use hugepages by default. Some workloads (e.g. that keep using large allocations for a while after making them) may benefit from echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag to get the kernel to defrag physical memory whenever needed, instead of falling back to 4k pages. (See the kernel docs). Use madvise(MADV_HUGEPAGE) after making large allocations (preferably still with 2MiB alignment) to more strongly encourage the kernel to stop and defrag now. defrag = always is too aggressive for most workloads and will spend more time copying pages around than it saves in TLB misses. (kcompactd could maybe be more efficient.)
顺便说一下,英特尔和AMD称2M页为“大页”,而“大页”只用于1G页。Linux使用“hugepage”来处理所有大于标准大小的文件。
(32位模式遗留(非pae)页表只有4M页作为第二大大小,只有2级页表具有更紧凑的条目。再大一点的是4G,但这是整个地址空间,转换的“级别”是CR3控件寄存器,而不是页面目录条目。IDK(如果它与Linux术语有关的话)。
附录B: Oprofile: Linux perf已经基本取代了Oprofile。性能列表/性能统计-e event1,event2…提供了编程HW性能计数器的大多数有用方法的名称。
perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,\
branches,branch-misses,instructions,uops_issued.any,\
uops_executed.thread,idq_uops_not_delivered.core -r2 ./a.out
几年前,需要ocperf.py包装器来将事件名称转换为代码,但现在perf已经内置了该功能。
有关使用它的一些例子,请参见Can x86的MOV真的是“免费的”?为什么我完全不能复制这个?