在性能方面,使用memcpy更好还是使用std::copy()更好?为什么?

char *bits = NULL;
...

bits = new (std::nothrow) char[((int *) copyMe->bits)[0]];
if (bits == NULL)
{
    cout << "ERROR Not enough memory.\n";
    exit(1);
}

memcpy (bits, copyMe->bits, ((int *) copyMe->bits)[0]);

当前回答

只是一个小小的补充:memcpy()和std::copy()之间的速度差异可能会根据是否启用或禁用优化而有所不同。在g++ 6.2.0和没有优化的情况下,memcpy()明显胜出:

Benchmark             Time           CPU Iterations
---------------------------------------------------
bm_memcpy            17 ns         17 ns   40867738
bm_stdcopy           62 ns         62 ns   11176219
bm_stdcopy_n         72 ns         72 ns    9481749

当启用优化时(-O3),一切看起来几乎相同:

Benchmark             Time           CPU Iterations
---------------------------------------------------
bm_memcpy             3 ns          3 ns  274527617
bm_stdcopy            3 ns          3 ns  272663990
bm_stdcopy_n          3 ns          3 ns  274732792

数组越大,效果就越不明显,但即使在N=1000时,memcpy()在没有启用优化的情况下,速度也大约是前者的两倍。

源代码(需要谷歌基准):

#include <string.h>
#include <algorithm>
#include <vector>
#include <benchmark/benchmark.h>

constexpr int N = 10;

void bm_memcpy(benchmark::State& state)
{
  std::vector<int> a(N);
  std::vector<int> r(N);

  while (state.KeepRunning())
  {
    memcpy(r.data(), a.data(), N * sizeof(int));
  }
}

void bm_stdcopy(benchmark::State& state)
{
  std::vector<int> a(N);
  std::vector<int> r(N);

  while (state.KeepRunning())
  {
    std::copy(a.begin(), a.end(), r.begin());
  }
}

void bm_stdcopy_n(benchmark::State& state)
{
  std::vector<int> a(N);
  std::vector<int> r(N);

  while (state.KeepRunning())
  {
    std::copy_n(a.begin(), N, r.begin());
  }
}

BENCHMARK(bm_memcpy);
BENCHMARK(bm_stdcopy);
BENCHMARK(bm_stdcopy_n);

BENCHMARK_MAIN()

/* EOF */

其他回答

我所知道的所有编译器都会在适当的时候用memcpy替换一个简单的std::copy,或者更好的是,将拷贝向矢量化,这样它会比memcpy更快。

在任何情况下:侧写和找出自己。不同的编译器会做不同的事情,它很可能不会完全按照你的要求去做。

请参阅编译器优化的介绍(pdf)。

下面是GCC对POD类型的简单std::拷贝所做的工作。

#include <algorithm>

struct foo
{
  int x, y;    
};

void bar(foo* a, foo* b, size_t n)
{
  std::copy(a, a + n, b);
}

下面是反汇编(只有-O优化),显示了对memmove的调用:

bar(foo*, foo*, unsigned long):
    salq    $3, %rdx
    sarq    $3, %rdx
    testq   %rdx, %rdx
    je  .L5
    subq    $8, %rsp
    movq    %rsi, %rax
    salq    $3, %rdx
    movq    %rdi, %rsi
    movq    %rax, %rdi
    call    memmove
    addq    $8, %rsp
.L5:
    rep
    ret

如果将函数签名更改为

void bar(foo* __restrict a, foo* __restrict b, size_t n)

然后memmove变成memcpy,以实现轻微的性能改进。注意,memcpy本身将被大量向量化。

理论上,memcpy可能具有微小的、难以察觉的、无限小的性能优势,只是因为它没有与std::copy相同的要求。从memcpy的手册页:

为避免溢出,请设置 由目标指向的数组 而源参数,应在 至少num字节,并且不应该 重叠(用于重叠内存 块,memmove是一个更安全的方法)。

换句话说,memcpy可以忽略数据重叠的可能性。(将重叠数组传递给memcpy是未定义的行为。)因此,memcpy不需要显式地检查这个条件,而std::copy可以使用,只要OutputIterator参数不在源范围内。注意,这并不是说源范围和目标范围不能重叠。

因此,由于std::copy有一些不同的要求,理论上它应该稍微慢一点(特别强调稍微慢一点),因为它可能会检查重叠的c数组,或者将c数组的复制委托给需要执行检查的memmove。但在实践中,您(和大多数分析人员)甚至可能察觉不到任何差异。

当然,如果不使用pod,无论如何也不能使用memcpy。

我的原则很简单。如果你正在使用c++,更喜欢c++库而不是C:)

始终使用std::copy,因为memcpy仅限于c风格的POD结构,如果目标实际上是POD,编译器可能会用memcpy替换对std::copy的调用。

另外,std::copy可以用于许多迭代器类型,而不仅仅是指针。Std::copy更灵活,没有性能损失,是明显的赢家。

这里我要反对一般的观点,即std::copy会有轻微的、几乎察觉不到的性能损失。我刚刚做了一个测试,发现这不是真的:我确实注意到了性能上的差异。然而,赢家是std::copy。

I wrote a C++ SHA-2 implementation. In my test, I hash 5 strings using all four SHA-2 versions (224, 256, 384, 512), and I loop 300 times. I measure times using Boost.timer. That 300 loop counter is enough to completely stabilize my results. I ran the test 5 times each, alternating between the memcpy version and the std::copy version. My code takes advantage of grabbing data in as large of chunks as possible (many other implementations operate with char / char *, whereas I operate with T / T * (where T is the largest type in the user's implementation that has correct overflow behavior), so fast memory access on the largest types I can is central to the performance of my algorithm. These are my results:

完成SHA-2测试运行的时间(秒)

std::copy   memcpy  % increase
6.11        6.29    2.86%
6.09        6.28    3.03%
6.10        6.29    3.02%
6.08        6.27    3.03%
6.08        6.27    3.03%

std::copy比memcpy的速度增加了2.99%

我的编译器是在Fedora 16 x86_64上的gcc 4.6.3。我的优化标志是-Ofast -march=native -funsafe-loop-optimizations。

SHA-2实现的代码。

我决定在我的MD5实现上运行一个测试。结果不太稳定,所以我决定跑10次。然而,在我最初的几次尝试之后,每次运行的结果都有很大的不同,所以我猜测有某种操作系统活动正在进行。我决定重新开始。

相同的编译器设置和标志。MD5只有一个版本,而且它比SHA-2快,所以我在一个类似的5个测试字符串集上进行了3000次循环。

以下是我最后的10个结果:

完成MD5测试运行的时间(以秒为单位)

std::copy   memcpy      % difference
5.52        5.56        +0.72%
5.56        5.55        -0.18%
5.57        5.53        -0.72%
5.57        5.52        -0.91%
5.56        5.57        +0.18%
5.56        5.57        +0.18%
5.56        5.53        -0.54%
5.53        5.57        +0.72%
5.59        5.57        -0.36%
5.57        5.56        -0.18%

std::copy比memcpy的速度降低了0.11%

我的MD5实现的代码

这些结果表明,std::copy在我的SHA-2测试中使用的某些优化是std::copy不能在MD5测试中使用的。在SHA-2测试中,两个数组都是在调用std::copy / memcpy的同一个函数中创建的。在我的MD5测试中,其中一个数组被作为函数参数传递给函数。

我做了一点更多的测试,看看我能做些什么来使std::复制更快。答案很简单:打开链接时间优化。这些是我打开LTO的结果(选项-flto在gcc中):

使用-flto完成MD5测试运行的时间(以秒为单位)

std::copy   memcpy      % difference
5.54        5.57        +0.54%
5.50        5.53        +0.54%
5.54        5.58        +0.72%
5.50        5.57        +1.26%
5.54        5.58        +0.72%
5.54        5.57        +0.54%
5.54        5.56        +0.36%
5.54        5.58        +0.72%
5.51        5.58        +1.25%
5.54        5.57        +0.54%

std::copy比memcpy的速度增加了0.72%

总之,使用std::copy似乎没有性能损失。事实上,这似乎是一种性能增益。

结果说明

那么为什么std::copy可以提高性能呢?

First, I would not expect it to be slower for any implementation, as long as the optimization of inlining is turned on. All compilers inline aggressively; it is possibly the most important optimization because it enables so many other optimizations. std::copy can (and I suspect all real world implementations do) detect that the arguments are trivially copyable and that memory is laid out sequentially. This means that in the worst case, when memcpy is legal, std::copy should perform no worse. The trivial implementation of std::copy that defers to memcpy should meet your compiler's criteria of "always inline this when optimizing for speed or size".

However, std::copy also keeps more of its information. When you call std::copy, the function keeps the types intact. memcpy operates on void *, which discards almost all useful information. For instance, if I pass in an array of std::uint64_t, the compiler or library implementer may be able to take advantage of 64-bit alignment with std::copy, but it may be more difficult to do so with memcpy. Many implementations of algorithms like this work by first working on the unaligned portion at the start of the range, then the aligned portion, then the unaligned portion at the end. If it is all guaranteed to be aligned, then the code becomes simpler and faster, and easier for the branch predictor in your processor to get correct.

过早优化?

std::copy is in an interesting position. I expect it to never be slower than memcpy and sometimes faster with any modern optimizing compiler. Moreover, anything that you can memcpy, you can std::copy. memcpy does not allow any overlap in the buffers, whereas std::copy supports overlap in one direction (with std::copy_backward for the other direction of overlap). memcpy only works on pointers, std::copy works on any iterators (std::map, std::vector, std::deque, or my own custom type). In other words, you should just use std::copy when you need to copy chunks of data around.