这里我要反对一般的观点,即std::copy会有轻微的、几乎察觉不到的性能损失。我刚刚做了一个测试,发现这不是真的:我确实注意到了性能上的差异。然而,赢家是std::copy。
I wrote a C++ SHA-2 implementation. In my test, I hash 5 strings using all four SHA-2 versions (224, 256, 384, 512), and I loop 300 times. I measure times using Boost.timer. That 300 loop counter is enough to completely stabilize my results. I ran the test 5 times each, alternating between the memcpy version and the std::copy version. My code takes advantage of grabbing data in as large of chunks as possible (many other implementations operate with char / char *, whereas I operate with T / T * (where T is the largest type in the user's implementation that has correct overflow behavior), so fast memory access on the largest types I can is central to the performance of my algorithm. These are my results:
完成SHA-2测试运行的时间(秒)
std::copy memcpy % increase
6.11 6.29 2.86%
6.09 6.28 3.03%
6.10 6.29 3.02%
6.08 6.27 3.03%
6.08 6.27 3.03%
std::copy比memcpy的速度增加了2.99%
我的编译器是在Fedora 16 x86_64上的gcc 4.6.3。我的优化标志是-Ofast -march=native -funsafe-loop-optimizations。
SHA-2实现的代码。
我决定在我的MD5实现上运行一个测试。结果不太稳定,所以我决定跑10次。然而,在我最初的几次尝试之后,每次运行的结果都有很大的不同,所以我猜测有某种操作系统活动正在进行。我决定重新开始。
相同的编译器设置和标志。MD5只有一个版本,而且它比SHA-2快,所以我在一个类似的5个测试字符串集上进行了3000次循环。
以下是我最后的10个结果:
完成MD5测试运行的时间(以秒为单位)
std::copy memcpy % difference
5.52 5.56 +0.72%
5.56 5.55 -0.18%
5.57 5.53 -0.72%
5.57 5.52 -0.91%
5.56 5.57 +0.18%
5.56 5.57 +0.18%
5.56 5.53 -0.54%
5.53 5.57 +0.72%
5.59 5.57 -0.36%
5.57 5.56 -0.18%
std::copy比memcpy的速度降低了0.11%
我的MD5实现的代码
这些结果表明,std::copy在我的SHA-2测试中使用的某些优化是std::copy不能在MD5测试中使用的。在SHA-2测试中,两个数组都是在调用std::copy / memcpy的同一个函数中创建的。在我的MD5测试中,其中一个数组被作为函数参数传递给函数。
我做了一点更多的测试,看看我能做些什么来使std::复制更快。答案很简单:打开链接时间优化。这些是我打开LTO的结果(选项-flto在gcc中):
使用-flto完成MD5测试运行的时间(以秒为单位)
std::copy memcpy % difference
5.54 5.57 +0.54%
5.50 5.53 +0.54%
5.54 5.58 +0.72%
5.50 5.57 +1.26%
5.54 5.58 +0.72%
5.54 5.57 +0.54%
5.54 5.56 +0.36%
5.54 5.58 +0.72%
5.51 5.58 +1.25%
5.54 5.57 +0.54%
std::copy比memcpy的速度增加了0.72%
总之,使用std::copy似乎没有性能损失。事实上,这似乎是一种性能增益。
结果说明
那么为什么std::copy可以提高性能呢?
First, I would not expect it to be slower for any implementation, as long as the optimization of inlining is turned on. All compilers inline aggressively; it is possibly the most important optimization because it enables so many other optimizations. std::copy can (and I suspect all real world implementations do) detect that the arguments are trivially copyable and that memory is laid out sequentially. This means that in the worst case, when memcpy is legal, std::copy should perform no worse. The trivial implementation of std::copy that defers to memcpy should meet your compiler's criteria of "always inline this when optimizing for speed or size".
However, std::copy also keeps more of its information. When you call std::copy, the function keeps the types intact. memcpy operates on void *, which discards almost all useful information. For instance, if I pass in an array of std::uint64_t, the compiler or library implementer may be able to take advantage of 64-bit alignment with std::copy, but it may be more difficult to do so with memcpy. Many implementations of algorithms like this work by first working on the unaligned portion at the start of the range, then the aligned portion, then the unaligned portion at the end. If it is all guaranteed to be aligned, then the code becomes simpler and faster, and easier for the branch predictor in your processor to get correct.
过早优化?
std::copy is in an interesting position. I expect it to never be slower than memcpy and sometimes faster with any modern optimizing compiler. Moreover, anything that you can memcpy, you can std::copy. memcpy does not allow any overlap in the buffers, whereas std::copy supports overlap in one direction (with std::copy_backward for the other direction of overlap). memcpy only works on pointers, std::copy works on any iterators (std::map, std::vector, std::deque, or my own custom type). In other words, you should just use std::copy when you need to copy chunks of data around.