我一直认为std::vector是“作为数组实现的”,等等等等。今天我去测试了一下,结果似乎不是这样:
以下是一些测试结果:
UseArray completed in 2.619 seconds
UseVector completed in 9.284 seconds
UseVectorPushBack completed in 14.669 seconds
The whole thing completed in 26.591 seconds
这大约要慢3 - 4倍!这并不能证明“向量可能会慢几纳秒”的评论是正确的。
我使用的代码是:
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>
class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}
~TestTimer()
{
using namespace std;
using namespace boost;
posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;
cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}
private:
std::string name;
boost::posix_time::ptime start;
};
struct Pixel
{
Pixel()
{
}
Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}
unsigned char r, g, b;
};
void UseVector()
{
TestTimer t("UseVector");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
void UseVectorPushBack()
{
TestTimer t("UseVectorPushBack");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.reserve(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}
void UseArray()
{
TestTimer t("UseArray");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
free(pixels);
}
}
int main()
{
TestTimer t1("The whole thing");
UseArray();
UseVector();
UseVectorPushBack();
return 0;
}
我做错了吗?还是我刚刚打破了这个性能神话?
我使用Visual Studio 2005中的发布模式。
在Visual c++中,#define _SECURE_SCL 0将UseVector减少了一半(减少到4秒)。在我看来,这真的是件大事。
一些分析器数据(像素对齐为32位):
g++ -msse3 -O3 -ftree-vectorize -g test.cpp -DNDEBUG && ./a.out
UseVector completed in 3.123 seconds
UseArray completed in 1.847 seconds
UseVectorPushBack completed in 9.186 seconds
The whole thing completed in 14.159 seconds
Blah
andrey@nv:~$ opannotate --source libcchem/src/a.out | grep "Total samples for file" -A3
Overflow stats not available
* Total samples for file : "/usr/include/c++/4.4/ext/new_allocator.h"
*
* 141008 52.5367
*/
--
* Total samples for file : "/home/andrey/libcchem/src/test.cpp"
*
* 61556 22.9345
*/
--
* Total samples for file : "/usr/include/c++/4.4/bits/stl_vector.h"
*
* 41956 15.6320
*/
--
* Total samples for file : "/usr/include/c++/4.4/bits/stl_uninitialized.h"
*
* 20956 7.8078
*/
--
* Total samples for file : "/usr/include/c++/4.4/bits/stl_construct.h"
*
* 2923 1.0891
*/
在分配器:
: // _GLIBCXX_RESOLVE_LIB_DEFECTS
: // 402. wrong new expression in [some_] allocator::construct
: void
: construct(pointer __p, const _Tp& __val)
141008 52.5367 : { ::new((void *)__p) _Tp(__val); }
向量:
:void UseVector()
:{ /* UseVector() total: 60121 22.3999 */
...
:
:
10790 4.0201 : for (int i = 0; i < dimension * dimension; ++i) {
:
495 0.1844 : pixels[i].r = 255;
:
12618 4.7012 : pixels[i].g = 0;
:
2253 0.8394 : pixels[i].b = 0;
:
: }
数组
:void UseArray()
:{ /* UseArray() total: 35191 13.1114 */
:
...
:
136 0.0507 : for (int i = 0; i < dimension * dimension; ++i) {
:
9897 3.6874 : pixels[i].r = 255;
:
3511 1.3081 : pixels[i].g = 0;
:
21647 8.0652 : pixels[i].b = 0;
大部分开销都在复制构造函数中。例如,
std::vector < Pixel > pixels;//(dimension * dimension, Pixel());
pixels.reserve(dimension * dimension);
for (int i = 0; i < dimension * dimension; ++i) {
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
它具有与数组相同的性能。
Martin York的回答让我很困扰,因为他似乎试图掩盖初始化问题。但他将冗余的默认构造确定为性能问题的根源是正确的。
[编辑:Martin的回答不再建议更改默认构造函数。]
对于眼前的问题,你当然可以调用2参数版本的向量<Pixel> ctor:
std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
如果你想用一个常数值初始化,这是一种常见的情况。但更普遍的问题是:如何有效地初始化比常数值更复杂的东西?
为此,您可以使用back_insert_iterator,这是一个迭代器适配器。这里有一个int类型的向量的例子,尽管一般的思想也适用于像素:
#include <iterator>
// Simple functor return a list of squares: 1, 4, 9, 16...
struct squares {
squares() { i = 0; }
int operator()() const { ++i; return i * i; }
private:
int i;
};
...
std::vector<int> v;
v.reserve(someSize); // To make insertions efficient
std::generate_n(std::back_inserter(v), someSize, squares());
或者,您可以使用copy()或transform()来代替generate_n()。
缺点是,构造初始值的逻辑需要移动到一个单独的类中,这比将其放在原位更不方便(尽管c++ 1x中的lambdas使这更好)。此外,我希望这仍然不会像基于malloc()的非stl版本那样快,但我希望它会接近,因为它只对每个元素进行一次构造。
这似乎取决于编译器标志。下面是一个基准代码:
#include <chrono>
#include <cmath>
#include <ctime>
#include <iostream>
#include <vector>
int main(){
int size = 1000000; // reduce this number in case your program crashes
int L = 10;
std::cout << "size=" << size << " L=" << L << std::endl;
{
srand( time(0) );
double * data = new double[size];
double result = 0.;
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
for( int l = 0; l < L; l++ ) {
for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
for( int i = 0; i < size; i++ ) result += data[i] * data[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout << "Calculation result is " << sqrt(result) << "\n";
std::cout << "Duration of C style heap array: " << duration << "ms\n";
delete data;
}
{
srand( 1 + time(0) );
double data[size]; // technically, non-compliant with C++ standard.
double result = 0.;
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
for( int l = 0; l < L; l++ ) {
for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
for( int i = 0; i < size; i++ ) result += data[i] * data[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout << "Calculation result is " << sqrt(result) << "\n";
std::cout << "Duration of C99 style stack array: " << duration << "ms\n";
}
{
srand( 2 + time(0) );
std::vector<double> data( size );
double result = 0.;
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
for( int l = 0; l < L; l++ ) {
for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
for( int i = 0; i < size; i++ ) result += data[i] * data[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout << "Calculation result is " << sqrt(result) << "\n";
std::cout << "Duration of std::vector array: " << duration << "ms\n";
}
return 0;
}
不同的优化标志给出不同的答案:
$ g++ -O0 benchmark.cpp
$ ./a.out
size=1000000 L=10
Calculation result is 181182
Duration of C style heap array: 118441ms
Calculation result is 181240
Duration of C99 style stack array: 104920ms
Calculation result is 181210
Duration of std::vector array: 124477ms
$g++ -O3 benchmark.cpp
$ ./a.out
size=1000000 L=10
Calculation result is 181213
Duration of C style heap array: 107803ms
Calculation result is 181198
Duration of C99 style stack array: 87247ms
Calculation result is 181204
Duration of std::vector array: 89083ms
$ g++ -Ofast benchmark.cpp
$ ./a.out
size=1000000 L=10
Calculation result is 181164
Duration of C style heap array: 93530ms
Calculation result is 181179
Duration of C99 style stack array: 80620ms
Calculation result is 181191
Duration of std::vector array: 78830ms
您的确切结果会有所不同,但这在我的机器上是非常典型的。