我一直认为std::vector是“作为数组实现的”,等等等等。今天我去测试了一下,结果似乎不是这样:

以下是一些测试结果:

UseArray completed in 2.619 seconds
UseVector completed in 9.284 seconds
UseVectorPushBack completed in 14.669 seconds
The whole thing completed in 26.591 seconds

这大约要慢3 - 4倍!这并不能证明“向量可能会慢几纳秒”的评论是正确的。

我使用的代码是:

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
    public:
        TestTimer(const std::string & name) : name(name),
            start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
        {
        }

        ~TestTimer()
        {
            using namespace std;
            using namespace boost;

            posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
            posix_time::time_duration d = now - start;

            cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
                " seconds" << endl;
        }

    private:
        std::string name;
        boost::posix_time::ptime start;
};

struct Pixel
{
    Pixel()
    {
    }

    Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
    {
    }

    unsigned char r, g, b;
};

void UseVector()
{
    TestTimer t("UseVector");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
        pixels.resize(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

void UseVectorPushBack()
{
    TestTimer t("UseVectorPushBack");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
            pixels.reserve(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
            pixels.push_back(Pixel(255, 0, 0));
    }
}

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }

        free(pixels);
    }
}

int main()
{
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();
    UseVectorPushBack();

    return 0;
}

我做错了吗?还是我刚刚打破了这个性能神话?

我使用Visual Studio 2005中的发布模式。


在Visual c++中,#define _SECURE_SCL 0将UseVector减少了一半(减少到4秒)。在我看来,这真的是件大事。


当前回答

公平地说,您不能将c++实现与C实现进行比较,即我所说的malloc版本。Malloc不创建对象——它只分配原始内存。然后不调用构造函数就把内存当作对象,这是拙劣的c++(可能是无效的——我把这个问题留给语言律师吧)。

也就是说,简单地将malloc更改为新的Pixel[维度*维度]并自由删除[]个像素,这与您所拥有的Pixel的简单实现没有太大区别。下面是我的盒子(E6600, 64位)上的结果:

UseArray completed in 0.269 seconds
UseVector completed in 1.665 seconds
UseVectorPushBack completed in 7.309 seconds
The whole thing completed in 9.244 seconds

但随着一个微小的变化,情况发生了变化:

Pixel.h

struct Pixel
{
    Pixel();
    Pixel(unsigned char r, unsigned char g, unsigned char b);

    unsigned char r, g, b;
};

Pixel.cc

#include "Pixel.h"

Pixel::Pixel() {}
Pixel::Pixel(unsigned char r, unsigned char g, unsigned char b) 
  : r(r), g(g), b(b) {}

main.cc

#include "Pixel.h"
[rest of test harness without class Pixel]
[UseArray now uses new/delete not malloc/free]

编译如下:

$ g++ -O3 -c -o Pixel.o Pixel.cc
$ g++ -O3 -c -o main.o main.cc
$ g++ -o main main.o Pixel.o

我们得到了非常不同的结果:

UseArray completed in 2.78 seconds
UseVector completed in 1.651 seconds
UseVectorPushBack completed in 7.826 seconds
The whole thing completed in 12.258 seconds

使用Pixel的非内联构造函数,std::vector现在可以击败原始数组。

通过std::vector和std:allocator进行分配的复杂性似乎太多了,无法像简单的新Pixel[n]那样有效地进行优化。然而,我们可以看到问题仅仅是分配而不是vector访问,通过调整几个测试函数来创建vector/数组,将其移到循环之外:

void UseVector()
{
    TestTimer t("UseVector");

    int dimension = 999;
    std::vector<Pixel> pixels;
    pixels.resize(dimension * dimension);

    for(int i = 0; i < 1000; ++i)
    {
        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

and

void UseArray()
{
    TestTimer t("UseArray");

    int dimension = 999;
    Pixel * pixels = new Pixel[dimension * dimension];

    for(int i = 0; i < 1000; ++i)
    {
        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
    delete [] pixels;
}

我们现在得到这些结果:

UseArray completed in 0.254 seconds
UseVector completed in 0.249 seconds
UseVectorPushBack completed in 7.298 seconds
The whole thing completed in 7.802 seconds

从这里我们可以了解到std::vector可以与原始数组进行访问,但是如果你需要多次创建和删除vector/数组,在元素的构造函数没有内联的情况下,创建一个复杂的对象将比创建一个简单的数组花费更多的时间。我不认为这很令人惊讶。

其他回答

试试这个:

void UseVectorCtor()
{
    TestTimer t("UseConstructor");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
    }
}

我得到了和数组几乎完全一样的性能。

The thing about vector is that it's a much more general tool than an array. And that means you have to consider how you use it. It can be used in a lot of different ways, providing functionality that an array doesn't even have. And if you use it "wrong" for your purpose, you incur a lot of overhead, but if you use it correctly, it is usually basically a zero-overhead data structure. In this case, the problem is that you separately initialized the vector (causing all elements to have their default ctor called), and then overwriting each element individually with the correct value. That is much harder for the compiler to optimize away than when you do the same thing with an array. Which is why the vector provides a constructor which lets you do exactly that: initialize N elements with value X.

当你使用它时,向量和数组一样快。

所以,你还没有打破性能神话。但是你已经证明了只有当你最优地使用向量时它才成立,这也是一个很好的观点。:)

好的一面是,它确实是最简单的用法,但却是最快的。如果您将我的代码片段(一行)与John Kugelman的答案进行对比,其中包含大量的调整和优化,但仍然不能完全消除性能差异,很明显,vector的设计非常巧妙。你不必费尽周折才能得到等于数组的速度。相反,您必须使用最简单的解决方案。

当我第一次看您的代码时,这很难说是一个公平的比较;我还以为你不是在比较苹果和苹果。所以我想,让构造函数和析构函数在所有测试中都被调用;然后比较。

const size_t dimension = 1000;

void UseArray() {
    TestTimer t("UseArray");
    for(size_t j = 0; j < dimension; ++j) {
        Pixel* pixels = new Pixel[dimension * dimension];
        for(size_t i = 0 ; i < dimension * dimension; ++i) {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = (unsigned char) (i % 255);
        }
        delete[] pixels;
    }
}

void UseVector() {
    TestTimer t("UseVector");
    for(size_t j = 0; j < dimension; ++j) {
        std::vector<Pixel> pixels(dimension * dimension);
        for(size_t i = 0; i < dimension * dimension; ++i) {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = (unsigned char) (i % 255);
        }
    }
}

int main() {
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();

    return 0;
}

我的想法是,在这样的设置下,它们应该是完全相同的。事实证明,我错了。

UseArray completed in 3.06 seconds
UseVector completed in 4.087 seconds
The whole thing completed in 10.14 seconds

那么为什么会出现30%的性能损失呢?STL的所有内容都在头文件中,因此编译器应该能够理解所需的所有内容。

我的想法是,它是在循环如何初始化默认构造函数的所有值。所以我做了一个测试:

class Tester {
public:
    static int count;
    static int count2;
    Tester() { count++; }
    Tester(const Tester&) { count2++; }
};
int Tester::count = 0;
int Tester::count2 = 0;

int main() {
    std::vector<Tester> myvec(300);
    printf("Default Constructed: %i\nCopy Constructed: %i\n", Tester::count, Tester::count2);

    return 0;
}

结果如我所料:

Default Constructed: 1
Copy Constructed: 300

这显然是减速的根源,因为vector使用复制构造函数从默认构造的对象初始化元素。

这意味着,以下伪操作顺序发生在向量的构造过程中:

Pixel pixel;
for (auto i = 0; i < N; ++i) vector[i] = pixel;

由于编译器创建了隐式复制构造函数,扩展为:

Pixel pixel;
for (auto i = 0; i < N; ++i) {
    vector[i].r = pixel.r;
    vector[i].g = pixel.g;
    vector[i].b = pixel.b;
}

所以默认像素保持未初始化,而其余的初始化使用默认像素的未初始化值。

与New[]/Delete[]相比:

int main() {
    Tester* myvec = new Tester[300];

    printf("Default Constructed: %i\nCopy Constructed:%i\n", Tester::count, Tester::count2);

    delete[] myvec;

    return 0;
}

Default Constructed: 300
Copy Constructed: 0

它们都保留了未初始化的值,并且没有对序列进行两次迭代。

有了这些信息,我们如何进行测试呢?让我们试着重写隐式复制构造函数。

Pixel(const Pixel&) {}

结果呢?

UseArray completed in 2.617 seconds
UseVector completed in 2.682 seconds
The whole thing completed in 5.301 seconds

总而言之,如果你经常制作数百个向量:重新考虑你的算法。

在任何情况下,STL实现都不会因为某些未知的原因而变慢,它只是完全按照你的要求去做;希望你能明白。

我做了一些长期以来一直想做的广泛测试。不妨分享一下。

这是我的双启动机i7-3770, 16GB Ram, x86_64, Windows 8.1和Ubuntu 16.04。更多信息和结论,备注如下。测试了MSVS 2017和g++(在Windows和Linux上)。

测试程序

#include <iostream>
#include <chrono>
//#include <algorithm>
#include <array>
#include <locale>
#include <vector>
#include <queue>
#include <deque>

// Note: total size of array must not exceed 0x7fffffff B = 2,147,483,647B
//  which means that largest int array size is 536,870,911
// Also image size cannot be larger than 80,000,000B
constexpr int long g_size = 100000;
int g_A[g_size];


int main()
{
    std::locale loc("");
    std::cout.imbue(loc);
    constexpr int long size = 100000;  // largest array stack size

    // stack allocated c array
    std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
    int A[size];
    for (int i = 0; i < size; i++)
        A[i] = i;

    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "c-style stack array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "c-style stack array size=" << sizeof(A) << "B\n\n";

    // global stack c array
    start = std::chrono::steady_clock::now();
    for (int i = 0; i < g_size; i++)
        g_A[i] = i;

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "global c-style stack array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "global c-style stack array size=" << sizeof(g_A) << "B\n\n";

    // raw c array heap array
    start = std::chrono::steady_clock::now();
    int* AA = new int[size];    // bad_alloc() if it goes higher than 1,000,000,000
    for (int i = 0; i < size; i++)
        AA[i] = i;

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "c-style heap array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "c-style heap array size=" << sizeof(AA) << "B\n\n";
    delete[] AA;

    // std::array<>
    start = std::chrono::steady_clock::now();
    std::array<int, size> AAA;
    for (int i = 0; i < size; i++)
        AAA[i] = i;
    //std::sort(AAA.begin(), AAA.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::array size=" << sizeof(AAA) << "B\n\n";

    // std::vector<>
    start = std::chrono::steady_clock::now();
    std::vector<int> v;
    for (int i = 0; i < size; i++)
        v.push_back(i);
    //std::sort(v.begin(), v.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::vector duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::vector size=" << v.size() * sizeof(v.back()) << "B\n\n";

    // std::deque<>
    start = std::chrono::steady_clock::now();
    std::deque<int> dq;
    for (int i = 0; i < size; i++)
        dq.push_back(i);
    //std::sort(dq.begin(), dq.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::deque duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::deque size=" << dq.size() * sizeof(dq.back()) << "B\n\n";

    // std::queue<>
    start = std::chrono::steady_clock::now();
    std::queue<int> q;
    for (int i = 0; i < size; i++)
        q.push(i);

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::queue duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::queue size=" << q.size() * sizeof(q.front()) << "B\n\n";
}

结果

//////////////////////////////////////////////////////////////////////////////////////////
// with MSVS 2017:
// >> cl /std:c++14 /Wall -O2 array_bench.cpp
//
// c-style stack array duration=0.15ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.130ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.90ms
// c-style heap array size=4B
//
// std::array duration=0.20ms
// std::array size=400,000B
//
// std::vector duration=0.544ms
// std::vector size=400,000B
//
// std::deque duration=1.375ms
// std::deque size=400,000B
//
// std::queue duration=1.491ms
// std::queue size=400,000B
//
//////////////////////////////////////////////////////////////////////////////////////////
//
// with g++ version:
//      - (tdm64-1) 5.1.0 on Windows
//      - (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 on Ubuntu 16.04
// >> g++ -std=c++14 -Wall -march=native -O2 array_bench.cpp -o array_bench
//
// c-style stack array duration=0ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.124ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.648ms
// c-style heap array size=8B
//
// std::array duration=1ms
// std::array size=400,000B
//
// std::vector duration=0.402ms
// std::vector size=400,000B
//
// std::deque duration=0.234ms
// std::deque size=400,000B
//
// std::queue duration=0.304ms
// std::queue size=400,000
//
//////////////////////////////////////////////////////////////////////////////////////////

笔记

平均10次组装。 我最初也使用std::sort()执行测试(您可以看到它被注释掉了),但后来删除了它们,因为没有显著的相对差异。

我的结论和评论

notice how global c-style array takes almost as much time as the heap c-style array Out of all tests I noticed a remarkable stability in std::array's time variations between consecutive runs, while others especially std:: data structs varied wildly in comparison O3 optimization didn't show any noteworthy time differences Removing optimization on Windows cl (no -O2) and on g++ (Win/Linux no -O2, no -march=native) increases times SIGNIFICANTLY. Particularly for std::data structs. Overall higher times on MSVS than g++, but std::array and c-style arrays faster on Windows without optimization g++ produces faster code than microsoft's compiler (apparently it runs faster even on Windows).

判决

当然,这是用于优化构建的代码。既然问题是关于std::vector,那么是的,它是!比普通数组(优化/未优化)慢。但是当您进行基准测试时,您自然希望生成优化的代码。

对我来说,这个节目的明星是std::array。

Martin York的回答让我很困扰,因为他似乎试图掩盖初始化问题。但他将冗余的默认构造确定为性能问题的根源是正确的。

[编辑:Martin的回答不再建议更改默认构造函数。]

对于眼前的问题,你当然可以调用2参数版本的向量<Pixel> ctor:

std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));

如果你想用一个常数值初始化,这是一种常见的情况。但更普遍的问题是:如何有效地初始化比常数值更复杂的东西?

为此,您可以使用back_insert_iterator,这是一个迭代器适配器。这里有一个int类型的向量的例子,尽管一般的思想也适用于像素:

#include <iterator>
// Simple functor return a list of squares: 1, 4, 9, 16...
struct squares {
    squares() { i = 0; }
    int operator()() const { ++i; return i * i; }

private:
    int i;
};

...

std::vector<int> v;
v.reserve(someSize);     // To make insertions efficient
std::generate_n(std::back_inserter(v), someSize, squares());

或者,您可以使用copy()或transform()来代替generate_n()。

缺点是,构造初始值的逻辑需要移动到一个单独的类中,这比将其放在原位更不方便(尽管c++ 1x中的lambdas使这更好)。此外,我希望这仍然不会像基于malloc()的非stl版本那样快,但我希望它会接近,因为它只对每个元素进行一次构造。

使用正确的选项,向量和数组可以生成相同的asm。在这些情况下,它们的速度当然是一样的,因为无论哪种方式都可以得到相同的可执行文件。