我正在开发一个程序,可以处理100GB或更大的文件。文件包含可变长度的记录集。我已经有了第一个实现并运行,现在正在寻求提高性能,特别是在更有效地做I/O,因为输入文件被扫描了很多次。

是否有使用mmap()与通过c++的fstream库读取块的经验法则?我想做的是将大块从磁盘读入缓冲区,处理缓冲区中的完整记录,然后读取更多数据。

mmap()代码可能会变得非常混乱,因为mmap的块需要位于页面大小的边界上(我的理解),而记录可能位于页面边界上。使用fstreams,我可以只寻找记录的开始并重新开始读取,因为我们不局限于读取位于页面大小边界上的块。

如果不首先编写完整的实现,我如何在这两个选项之间做出决定呢?有什么经验法则(例如,mmap()快2倍)或简单的测试吗?


当前回答

Mmap要快得多。你可以写一个简单的基准测试来证明:

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
  in.read(data, 0x1000);
  // do something with data
}

对比:

const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;

int fd = open("filename.bin", O_RDONLY);

while (off < file_size)
{
  data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
  // do stuff with data
  munmap(data, page_size);
  off += page_size;
}

显然,我省略了一些细节(例如,如果文件不是page_size的倍数,如何确定何时到达文件的末尾),但实际上不应该比这复杂得多。

如果可以,可以尝试将数据分解为多个文件,这些文件可以整体而不是部分地使用mmap()进行编辑(简单得多)。

几个月前,我为boost_iostreams实现了一个不成熟的滑动窗口mmap()-ed流类,但没有人关心,我忙着做其他事情。最不幸的是,几周前我删除了一个旧的未完成项目的存档,这是受害者之一:-(

更新:我还应该补充一个警告,这个基准测试在Windows中看起来会有很大不同,因为微软实现了一个漂亮的文件缓存,它首先完成了您在mmap中所做的大部分工作。例如,对于经常访问的文件,你可以执行std::ifstream.read(),它会和mmap一样快,因为文件缓存已经为你做了一个内存映射,而且它是透明的。

Final Update: Look, people: across a lot of different platform combinations of OS and standard libraries and disks and memory hierarchies, I can't say for certain that the system call mmap, viewed as a black box, will always always always be substantially faster than read. That wasn't exactly my intent, even if my words could be construed that way. Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true. If you find experimentally that there's no difference between the two, then the only explanation that seems reasonable to me is that your platform implements memory-mapping under the covers in a way that is advantageous to the performance of calls to read. The only way to be absolutely certain that you're using memory-mapped i/o in a portable way is to use mmap. If you don't care about portability and you can rely on the particular characteristics of your target platforms, then using read may be suitable without sacrificing measurably any performance.

编辑以清除答案列表: @jbl:

滑动窗口mmap发出声音 有趣。你能多说一点吗 呢?

当然-我正在为Git写一个c++库(一个libgit++,如果你愿意的话),我遇到了一个类似的问题:我需要能够打开大(非常大)的文件,而不是有一个完全的性能狗(因为它将与std::fstream)。

Boost::Iostreams already has a mapped_file Source, but the problem was that it was mmapping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have .pack files in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers of Boost::Iostreams, I implemented a Source, which is more or less another view of the interaction between std::streambuf and std::istream. You could also try a similar approach by just inheriting std::filebuf into a mapped_filebuf and similarly, inheriting std::fstream into a mapped_fstream. It's the interaction between the two that's difficult to get right. Boost::Iostreams has some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.

其他回答

在我看来,使用mmap()“只是”使开发人员不必编写自己的缓存代码。在一个简单的“每读一次文件”的情况下,这并不难(尽管mlbrock指出,您仍然将内存副本保存到进程空间中),但如果您在文件中来回执行或跳过位等等,我相信内核开发人员在实现缓存方面可能比我做得更好……

这听起来像是多线程的一个很好的用例……我认为你可以很容易地设置一个线程读取数据,而其他(s)处理它。这可能是一种显著提高感知表现的方法。只是一个想法。

I remember mapping a huge file containing a tree structure into memory years ago. I was amazed by the speed compared to normal de-serialization which involves lot of work in memory, like allocating tree nodes and setting pointers. So in fact I was comparing a single call to mmap (or its counterpart on Windows) against many (MANY) calls to operator new and constructor calls. For such kind of task, mmap is unbeatable compared to de-serialization. Of course one should look into boosts relocatable pointer for this.

也许您应该对文件进行预处理,这样每个记录都在一个单独的文件中(或者至少每个文件都是mmap可用的大小)。

另外,您能否在处理下一条记录之前完成每条记录的所有处理步骤?也许这样可以避免一些IO开销?

我很抱歉本·柯林斯丢失了他的滑动窗口mmap源代码。这在Boost中是很好的。

是的,映射文件要快得多。您实际上是在使用OS虚拟内存子系统来关联内存和磁盘,反之亦然。可以这样想:如果OS内核开发者可以让它更快,他们会的。因为这样做几乎使所有事情都更快:数据库、启动时间、程序加载时间等等。

滑动窗口方法实际上并不难,因为可以一次映射多个连续的页面。因此,记录的大小并不重要,只要最大的记录可以放入内存。重要的是做好簿记工作。

如果一个记录不是从getpagesize()边界开始,那么映射就必须从前一页开始。映射区域的长度从记录的第一个字节(如有必要向下舍入到getpagesize()的最近倍数)扩展到记录的最后一个字节(四舍五入到getpagesize()的最近倍数)。当您完成一条记录的处理后,您可以unmap()它,然后继续到下一条记录。

这在Windows下工作也很好,使用CreateFileMapping()和MapViewOfFile()(和GetSystemInfo()来获取SYSTEM_INFO。dwAllocationGranularity——不是SYSTEM_INFO.dwPageSize)。