哪个更快:堆栈分配还是堆分配

这个问题可能听起来相当简单，但这是我与另一位开发人员的争论。

我小心地在我可以的地方进行堆栈分配，而不是堆分配。他和我说话，看着我的肩膀，并评论说，这是没有必要的，因为他们的表现是一样的。

在我的印象中，增加堆栈是一个常数时间，而堆分配的性能取决于当前堆的复杂性，包括分配(找到合适大小的孔)和反分配(缩小孔以减少碎片，如果我没有弄错的话，许多标准库实现在删除过程中需要时间来做这件事)。

在我看来，这可能非常依赖于编译器。在这个项目中，我特别使用了一个用于PPC架构的Metrowerks编译器。对这种组合的深入了解将是最有帮助的，但一般来说，对于GCC和msvc++，情况如何?堆分配不如堆栈分配高效吗?没有区别吗?还是差异如此之小以至于变成了毫无意义的微观优化。

当前回答

老实说，写一个程序来比较性能是很简单的:

#include <ctime>
#include <iostream>

namespace {
    class empty { }; // even empty classes take up 1 byte of space, minimum
}

int main()
{
    std::clock_t start = std::clock();
    for (int i = 0; i < 100000; ++i)
        empty e;
    std::clock_t duration = std::clock() - start;
    std::cout << "stack allocation took " << duration << " clock ticks\n";
    start = std::clock();
    for (int i = 0; i < 100000; ++i) {
        empty* e = new empty;
        delete e;
    };
    duration = std::clock() - start;
    std::cout << "heap allocation took " << duration << " clock ticks\n";
}

有人说，愚蠢的一致性是小心眼的妖怪。显然，优化编译器是许多程序员心中的妖怪。这个讨论曾经在答案的底部，但人们显然不想读那么远，所以我把它移到这里，以避免遇到我已经回答过的问题。

优化编译器可能会注意到这段代码什么都不做，并可能将其全部优化。这样做是优化器的工作，与优化器斗争是徒劳的。

我建议在编译此代码时关闭优化，因为没有好方法可以欺骗当前正在使用或将来将使用的每个优化器。

任何打开优化器，然后抱怨与它斗争的人都应该受到公众的嘲笑。

如果我关心纳秒精度，我就不会使用std::clock()。如果我想把这些结果作为博士论文发表，我会对此做更大的研究，我可能会比较GCC、Tendra/Ten15、LLVM、Watcom、Borland、Visual c++、Digital Mars、ICC和其他编译器。实际上，堆分配所花费的时间是堆栈分配的数百倍，我认为进一步研究这个问题没有任何用处。

优化器的任务是去除我正在测试的代码。我不认为有任何理由告诉优化器运行，然后试图欺骗优化器不进行实际优化。但如果我看到这样做的价值，我会做以下一项或多项:

Add a data member to empty, and access that data member in the loop; but if I only ever read from the data member the optimizer can do constant folding and remove the loop; if I only ever write to the data member, the optimizer may skip all but the very last iteration of the loop. Additionally, the question wasn't "stack allocation and data access vs. heap allocation and data access." Declare e volatile, but volatile is often compiled incorrectly (PDF). Take the address of e inside the loop (and maybe assign it to a variable that is declared extern and defined in another file). But even in this case, the compiler may notice that -- on the stack at least -- e will always be allocated at the same memory address, and then do constant folding like in (1) above. I get all iterations of the loop, but the object is never actually allocated.

Beyond the obvious, this test is flawed in that it measures both allocation and deallocation, and the original question didn't ask about deallocation. Of course variables allocated on the stack are automatically deallocated at the end of their scope, so not calling delete would (1) skew the numbers (stack deallocation is included in the numbers about stack allocation, so it's only fair to measure heap deallocation) and (2) cause a pretty bad memory leak, unless we keep a reference to the new pointer and call delete after we've got our time measurement.

在我的机器上，在Windows上使用g++ 3.4.4，对于任何小于100000个分配的堆栈和堆分配，我都得到“0个时钟滴答”，即使这样，对于堆栈分配，我也得到“0个时钟滴答”，对于堆分配，我得到“15个时钟滴答”。当我测量10,000,000个分配时，堆栈分配需要31个时钟滴答，堆分配需要1562个时钟滴答。

是的，优化编译器可以省略创建空对象。如果我理解正确的话，它甚至可以省略整个第一个循环。当我将迭代次数增加到10,000,000次时，堆栈分配花费了31个时钟节拍，堆分配花费了1562个时钟节拍。我认为可以肯定地说，在没有告诉g++优化可执行文件的情况下，g++并没有省略构造函数。

在我写这篇文章之后的几年里，Stack Overflow的首选是发布优化构建的性能。总的来说，我认为这是正确的。然而，我仍然认为，当你实际上不希望代码被优化时，让编译器去优化代码是愚蠢的。在我看来，这很像给代客泊车额外付费，却拒绝交出钥匙。在这个特殊情况下，我不希望优化器运行。

使用稍微修改过的基准测试版本(以解决原始程序在每次循环时都没有在堆栈上分配一些东西的问题)，并在不进行优化的情况下编译，但链接到发布库(以解决我们不希望包括任何由于链接到调试库而导致的放缓的问题):

#include <cstdio>
#include <chrono>

namespace {
    void on_stack()
    {
        int i;
    }

    void on_heap()
    {
        int* i = new int;
        delete i;
    }
}

int main()
{
    auto begin = std::chrono::system_clock::now();
    for (int i = 0; i < 1000000000; ++i)
        on_stack();
    auto end = std::chrono::system_clock::now();

    std::printf("on_stack took %f seconds\n", std::chrono::duration<double>(end - begin).count());

    begin = std::chrono::system_clock::now();
    for (int i = 0; i < 1000000000; ++i)
        on_heap();
    end = std::chrono::system_clock::now();

    std::printf("on_heap took %f seconds\n", std::chrono::duration<double>(end - begin).count());
    return 0;
}

显示:

on_stack took 2.070003 seconds
on_heap took 57.980081 seconds

在我的系统上，当用命令行编译cl foo。cc /Od /MT /EHsc。

您可能不同意我获得非优化构建的方法。这很好:您可以随意修改基准测试。当我打开优化，我得到:

on_stack took 0.000000 seconds
on_heap took 51.608723 seconds

这并不是因为堆栈分配实际上是瞬时的，而是因为任何半像样的编译器都能注意到on_stack没有做任何有用的事情，可以进行优化。我的Linux笔记本电脑上的GCC也注意到on_heap没有做任何有用的事情，并优化了它:

on_stack took 0.000003 seconds
on_heap took 0.000002 seconds

2008-10-02 18:11:57

其他回答

关于Xbox 360 Xenon处理器上的堆栈与堆分配，我了解到一件有趣的事情，这可能也适用于其他多核系统，即在堆上分配会导致进入临界区以停止所有其他核，这样分配就不会发生冲突。因此，在一个紧密循环中，堆栈分配是固定大小数组的方法，因为它可以防止停顿。

如果您正在为多核/多进程编码，这可能是另一个需要考虑的加速，因为您的堆栈分配将只由运行您的作用域函数的核心可见，而不会影响任何其他内核/ cpu。

2009-03-02 01:55:50

自然，堆栈分配更快。使用堆分配，分配器必须在某处找到空闲内存。使用堆栈分配，编译器只需要给你的函数一个更大的堆栈框架就可以完成，这意味着分配完全不需要花费时间。(我假设您没有使用alloca或任何东西来动态分配堆栈空间，但即使这样，它也非常快。)

但是，您必须警惕隐藏的动态分配。例如:

void some_func()
{
    std::vector<int> my_vector(0x1000);
    // Do stuff with the vector...
}

您可能认为这会在堆栈上分配4 KiB，但您错了。它在堆栈上分配vector实例，但该vector实例又在堆上分配它的4 KiB，因为vector总是在堆上分配它的内部数组(至少除非您指定了一个自定义分配器，这里我不会深入讨论)。如果您希望使用类似stl的容器在堆栈上进行分配，则可能需要std::array或boost::static_vector(由外部boost库提供)。

2022-12-06 20:51:26

class Foo {
public:
    Foo(int a) {

    }
}
int func() {
    int a1, a2;
    std::cin >> a1;
    std::cin >> a2;

    Foo f1(a1);
    __asm push a1;
    __asm lea ecx, [this];
    __asm call Foo::Foo(int);

    Foo* f2 = new Foo(a2);
    __asm push sizeof(Foo);
    __asm call operator new;//there's a lot instruction here(depends on system)
    __asm push a2;
    __asm call Foo::Foo(int);

    delete f2;
}

It would be like this in asm. When you're in func, the f1 and pointer f2 has been allocated on stack (automated storage). And by the way, Foo f1(a1) has no instruction effects on stack pointer (esp),It has been allocated, if func wants get the member f1, it's instruction is something like this: lea ecx [ebp+f1], call Foo::SomeFunc(). Another thing the stack allocate may make someone think the memory is something like FIFO, the FIFO just happened when you go into some function, if you are in the function and allocate something like int i = 0, there no push happened.

2015-05-05 02:42:03

您可以为特定大小的对象编写一个非常高性能的特殊堆分配器。但是，一般的堆分配器性能不是特别好。

我也同意Torbjörn Gyllebring关于对象的预期生命期的观点。好点!

2008-10-02 06:08:22

前面提到过，堆栈分配只是移动堆栈指针，即大多数架构上的一条指令。将其与堆分配中通常发生的情况进行比较。

操作系统以链表的形式维护部分空闲内存，有效负载数据由指向空闲部分起始地址的指针和空闲部分的大小组成。为了分配X字节的内存，将遍历链表，并按顺序访问每个音符，检查其大小是否至少为X。当找到大小为P >= X的部分时，将P分成大小为X和P-X的两个部分。更新链表并返回指向第一部分的指针。

如您所见，堆分配取决于许多因素，如您请求的内存大小、内存的碎片程度等等。

2009-03-02 00:46:06

哪个更快:堆栈分配还是堆分配

推荐文章

最新文章

标签