我一直认为std::vector是“作为数组实现的”,等等等等。今天我去测试了一下,结果似乎不是这样:

以下是一些测试结果:

UseArray completed in 2.619 seconds
UseVector completed in 9.284 seconds
UseVectorPushBack completed in 14.669 seconds
The whole thing completed in 26.591 seconds

这大约要慢3 - 4倍!这并不能证明“向量可能会慢几纳秒”的评论是正确的。

我使用的代码是:

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
    public:
        TestTimer(const std::string & name) : name(name),
            start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
        {
        }

        ~TestTimer()
        {
            using namespace std;
            using namespace boost;

            posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
            posix_time::time_duration d = now - start;

            cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
                " seconds" << endl;
        }

    private:
        std::string name;
        boost::posix_time::ptime start;
};

struct Pixel
{
    Pixel()
    {
    }

    Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
    {
    }

    unsigned char r, g, b;
};

void UseVector()
{
    TestTimer t("UseVector");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
        pixels.resize(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

void UseVectorPushBack()
{
    TestTimer t("UseVectorPushBack");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
            pixels.reserve(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
            pixels.push_back(Pixel(255, 0, 0));
    }
}

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }

        free(pixels);
    }
}

int main()
{
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();
    UseVectorPushBack();

    return 0;
}

我做错了吗?还是我刚刚打破了这个性能神话?

我使用Visual Studio 2005中的发布模式。


在Visual c++中,#define _SECURE_SCL 0将UseVector减少了一半(减少到4秒)。在我看来,这真的是件大事。


当前回答

这似乎取决于编译器标志。下面是一个基准代码:

#include <chrono>
#include <cmath>
#include <ctime>
#include <iostream>
#include <vector>


int main(){

    int size = 1000000; // reduce this number in case your program crashes
    int L = 10;

    std::cout << "size=" << size << " L=" << L << std::endl;
    {
        srand( time(0) );
        double * data = new double[size];
        double result = 0.;
        std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
        for( int l = 0; l < L; l++ ) {
            for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
            for( int i = 0; i < size; i++ ) result += data[i] * data[i];
        }
        std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
        std::cout << "Calculation result is " << sqrt(result) << "\n";
        std::cout << "Duration of C style heap array:    " << duration << "ms\n";
        delete data;
    }

    {
        srand( 1 + time(0) );
        double data[size]; // technically, non-compliant with C++ standard.
        double result = 0.;
        std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
        for( int l = 0; l < L; l++ ) {
            for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
            for( int i = 0; i < size; i++ ) result += data[i] * data[i];
        }
        std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
        std::cout << "Calculation result is " << sqrt(result) << "\n";
        std::cout << "Duration of C99 style stack array: " << duration << "ms\n";
    }

    {
        srand( 2 + time(0) );
        std::vector<double> data( size );
        double result = 0.;
        std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
        for( int l = 0; l < L; l++ ) {
            for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
            for( int i = 0; i < size; i++ ) result += data[i] * data[i];
        }
        std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
        std::cout << "Calculation result is " << sqrt(result) << "\n";
        std::cout << "Duration of std::vector array:     " << duration << "ms\n";
    }

    return 0;
}

不同的优化标志给出不同的答案:

$ g++ -O0 benchmark.cpp 
$ ./a.out 
size=1000000 L=10
Calculation result is 181182
Duration of C style heap array:    118441ms
Calculation result is 181240
Duration of C99 style stack array: 104920ms
Calculation result is 181210
Duration of std::vector array:     124477ms
$g++ -O3 benchmark.cpp
$ ./a.out 
size=1000000 L=10
Calculation result is 181213
Duration of C style heap array:    107803ms
Calculation result is 181198
Duration of C99 style stack array: 87247ms
Calculation result is 181204
Duration of std::vector array:     89083ms
$ g++ -Ofast benchmark.cpp 
$ ./a.out 
size=1000000 L=10
Calculation result is 181164
Duration of C style heap array:    93530ms
Calculation result is 181179
Duration of C99 style stack array: 80620ms
Calculation result is 181191
Duration of std::vector array:     78830ms

您的确切结果会有所不同,但这在我的机器上是非常典型的。

其他回答

顺便说一下,你在使用vector的类中看到的减速也发生在标准类型中,比如int。这是一个多线程代码:

#include <iostream>
#include <cstdio>
#include <map>
#include <string>
#include <typeinfo>
#include <vector>
#include <pthread.h>
#include <sstream>
#include <fstream>
using namespace std;

//pthread_mutex_t map_mutex=PTHREAD_MUTEX_INITIALIZER;

long long num=500000000;
int procs=1;

struct iterate
{
    int id;
    int num;
    void * member;
    iterate(int a, int b, void *c) : id(a), num(b), member(c) {}
};

//fill out viterate and piterate
void * viterate(void * input)
{
    printf("am in viterate\n");
    iterate * info=static_cast<iterate *> (input);
    // reproduce member type
    vector<int> test= *static_cast<vector<int>*> (info->member);
    for (int i=info->id; i<test.size(); i+=info->num)
    {
        //printf("am in viterate loop\n");
        test[i];
    }
    pthread_exit(NULL);
}

void * piterate(void * input)
{
    printf("am in piterate\n");
    iterate * info=static_cast<iterate *> (input);;
    int * test=static_cast<int *> (info->member);
    for (int i=info->id; i<num; i+=info->num) {
        //printf("am in piterate loop\n");
        test[i];
    }
    pthread_exit(NULL);
}

int main()
{
    cout<<"producing vector of size "<<num<<endl;
    vector<int> vtest(num);
    cout<<"produced  a vector of size "<<vtest.size()<<endl;
    pthread_t thread[procs];

    iterate** it=new iterate*[procs];
    int ans;
    void *status;

    cout<<"begining to thread through the vector\n";
    for (int i=0; i<procs; i++) {
        it[i]=new iterate(i, procs, (void *) &vtest);
    //  ans=pthread_create(&thread[i],NULL,viterate, (void *) it[i]);
    }
    for (int i=0; i<procs; i++) {
        pthread_join(thread[i], &status);
    }
    cout<<"end of threading through the vector";
    //reuse the iterate structures

    cout<<"producing a pointer with size "<<num<<endl;
    int * pint=new int[num];
    cout<<"produced a pointer with size "<<num<<endl;

    cout<<"begining to thread through the pointer\n";
    for (int i=0; i<procs; i++) {
        it[i]->member=&pint;
        ans=pthread_create(&thread[i], NULL, piterate, (void*) it[i]);
    }
    for (int i=0; i<procs; i++) {
        pthread_join(thread[i], &status);
    }
    cout<<"end of threading through the pointer\n";

    //delete structure array for iterate
    for (int i=0; i<procs; i++) {
        delete it[i];
    }
    delete [] it;

    //delete pointer
    delete [] pint;

    cout<<"end of the program"<<endl;
    return 0;
}

代码中的行为表明vector的实例化是代码中最长的部分。一旦你通过瓶颈。其余的代码运行得非常快。无论在多少个线程上运行,这都是正确的。

顺便说一下,忽略那些疯狂的包含数。我一直在使用这段代码来测试一个项目的东西,所以包含的数量不断增长。

一些分析器数据(像素对齐为32位):

g++ -msse3 -O3 -ftree-vectorize -g test.cpp -DNDEBUG && ./a.out
UseVector completed in 3.123 seconds
UseArray completed in 1.847 seconds
UseVectorPushBack completed in 9.186 seconds
The whole thing completed in 14.159 seconds

Blah

andrey@nv:~$ opannotate --source libcchem/src/a.out  | grep "Total samples for file" -A3
Overflow stats not available
 * Total samples for file : "/usr/include/c++/4.4/ext/new_allocator.h"
 *
 * 141008 52.5367
 */
--
 * Total samples for file : "/home/andrey/libcchem/src/test.cpp"
 *
 *  61556 22.9345
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_vector.h"
 *
 *  41956 15.6320
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_uninitialized.h"
 *
 *  20956  7.8078
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_construct.h"
 *
 *   2923  1.0891
 */

在分配器:

               :      // _GLIBCXX_RESOLVE_LIB_DEFECTS
               :      // 402. wrong new expression in [some_] allocator::construct
               :      void
               :      construct(pointer __p, const _Tp& __val)
141008 52.5367 :      { ::new((void *)__p) _Tp(__val); }

向量:

               :void UseVector()
               :{ /* UseVector() total:  60121 22.3999 */
...
               :
               :
 10790  4.0201 :        for (int i = 0; i < dimension * dimension; ++i) {
               :
   495  0.1844 :            pixels[i].r = 255;
               :
 12618  4.7012 :            pixels[i].g = 0;
               :
  2253  0.8394 :            pixels[i].b = 0;
               :
               :        }

数组

               :void UseArray()
               :{ /* UseArray() total:  35191 13.1114 */
               :
...
               :
   136  0.0507 :        for (int i = 0; i < dimension * dimension; ++i) {
               :
  9897  3.6874 :            pixels[i].r = 255;
               :
  3511  1.3081 :            pixels[i].g = 0;
               :
 21647  8.0652 :            pixels[i].b = 0;

大部分开销都在复制构造函数中。例如,

    std::vector < Pixel > pixels;//(dimension * dimension, Pixel());

    pixels.reserve(dimension * dimension);

    for (int i = 0; i < dimension * dimension; ++i) {

        pixels[i].r = 255;

        pixels[i].g = 0;

        pixels[i].b = 0;
    }

它具有与数组相同的性能。

下面是vector中的push_back方法的工作原理:

vector在初始化时分配X个空间。 如下所述,它检查当前底层数组中是否有空间用于该项。 它复制push_back调用中的项。

调用push_back X项后:

vector将kX的空间重新分配到第二个数组中。 它将第一个数组的项复制到第二个数组。 丢弃第一个数组。 现在使用第二个数组作为存储,直到它达到kX项。

重复。如果你没有预留空间,它肯定会变慢。更重要的是,如果复制项目的成本很高,那么像这样的“push_back”会让你生吞活剥。

至于向量和数组的区别,我同意其他人的观点。在发布版中运行,打开优化,并放入更多的标志,这样微软的友好人员就不会为你而烦恼了。

还有一件事,如果你不需要调整大小,使用Boost.Array。

好吧,因为vector::resize()比普通内存分配(由malloc)做更多的处理。

尝试在复制构造函数中设置断点(定义它以便可以设置断点!),就会增加处理时间。

这是一个古老而流行的问题。

在这一点上,许多程序员将使用c++ 11。在c++ 11中,OP的代码对于UseArray或UseVector运行得同样快。

UseVector completed in 3.74482 seconds
UseArray completed in 3.70414 seconds

基本的问题是,当你的像素结构未初始化时,std::vector<T>::resize(size_t, T const&=T())接受一个默认构造的像素并复制它。编译器没有注意到它被要求复制未初始化的数据,所以它实际执行了复制。

在c++ 11中,std::vector<T>::resize有两个重载。第一个是std::vector<T>::resize(size_t),另一个是std::vector<T>::resize(size_t, T const&)。这意味着当调用resize而不带第二个参数时,它只是默认构造,而编译器足够聪明,可以意识到默认构造什么也不做,因此它跳过了缓冲区的传递。

(添加这两个重载是为了处理可移动、可构造和不可复制类型——处理未初始化数据时的性能提升是一个额外的好处)。

push_back解决方案还执行fencepost检查,这降低了它的速度,因此它仍然比malloc版本慢。

现场示例(我还用chrono::high_resolution_clock替换了计时器)。

注意,如果你有一个通常需要初始化的结构,但你想在增加缓冲区后处理它,你可以使用自定义std::vector分配器来做到这一点。如果你想把它移动到一个更正常的std::vector,我相信仔细使用allocator_traits和重写==可能会成功,但我不确定。