宇宙射线:它们影响程序的概率是多少?

注意:这个答案不是关于物理的，而是关于非ecc内存模块的无声内存错误。有些错误可能来自外部空间，有些则来自桌面内部空间。

在大型服务器场(如CERN集群和谷歌数据中心)上有几项关于ECC内存故障的研究。带有ECC的服务器级硬件可以检测和纠正所有的单比特错误，并检测许多多比特错误。

我们可以假设有很多非ecc台式机(以及非ecc移动智能手机)。如果我们检查论文的ecc可纠正错误率(单位翻转)，我们可以知道非ecc内存上的静默内存损坏率。

Large scale CERN 2007 study "Data integrity": vendors declares "Bit Error Rate of 10-12 for their memory modules", "a observed error rate is 4 orders of magnitude lower than expected". For data-intensive tasks (8 GB/s of memory reading) this means that single bit flip may occur every minute (10-12 vendors BER) or once in two days (10-16 BER). 2009 Google's paper "DRAM Errors in the Wild: A Large-Scale Field Study" says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM after my calculations. Paper says the same: "mean correctable error rates of 2000–6000 per GB per year". 2012 Sandia report "Detection and Correction of Silent Data Corruptionfor Large-Scale High-Performance Computing": "double bit flips were deemed unlikely" but at ORNL's dense Cray XT5 they are "at a rate of one per day for 75,000+ DIMMs" even with ECC. And single-bit errors should be higher.

因此，如果程序有很大的数据集(几GB)，或者有很高的内存读写速率(GB/s或更高)，并且它运行了几个小时，那么我们可以期望在桌面硬件上进行几次静默位翻转。memtest检测不到这个速率，DRAM模块表现良好。

长集群在数千台非ecc pc上运行，比如BOINC，互联网范围的网格计算总是会有内存位翻转、磁盘和网络静默错误造成的错误。

And for bigger machines (10 thousands of servers) even with ECC protection from single-bit errors, as we see in Sandia's 2012 report, there can be double-bit flips every day, so you will have no chance to run full-size parallel program for several days (without regular checkpointing and restarting from last good checkpoint in case of double error). The huge machines will also get bit-flips in their caches and cpu registers (both architectural and internal chip's triggers e.g. in ALU datapath), because not all of them are protected by ECC.

PS:如果DRAM模块坏了，情况会更糟。例如，我在笔记本电脑上安装了新的DRAM，几周后它就死机了。它开始出现很多内存错误。我得到:笔记本电脑挂起，linux重启，运行fsck，在根文件系统上发现错误，并说它想在纠正错误后重新启动。但是在每次重新启动(我做了大约5-6次)时，仍然会在根文件系统上发现错误。

2014-05-11 00:14:57

显然，这并非微不足道。这篇《新科学家》的文章引用了一份英特尔专利申请:

“宇宙射线引发的电脑死机已经发生过，而且随着芯片中器件(例如晶体管)尺寸的减小，预计死机的频率将会增加。这个问题预计将成为未来十年计算机可靠性的主要限制因素。”

你可以在这里阅读完整的专利。

2010-04-05 20:26:37