





However, there are larger issues when dealing with the overall performance of stack vs. heap based allocation (or in slightly better terms, local vs. external allocation). Usually, heap (external) allocation is slow because it is dealing with many different kinds of allocations and allocation patterns. Reducing the scope of the allocator you are using (making it local to the algorithm/code) will tend to increase performance without any major changes. Adding better structure to your allocation patterns, for example, forcing a LIFO ordering on allocation and deallocation pairs can also improve your allocator's performance by using the allocator in a simpler and more structured way. Or, you can use or write an allocator tuned for your particular allocation pattern; most programs allocate a few discrete sizes frequently, so a heap that is based on a lookaside buffer of a few fixed (preferably known) sizes will perform extremely well. Windows uses its low-fragmentation-heap for this very reason.





在有多个线程的32位操作系统上,堆栈通常是相当有限的(尽管通常至少是几mb),因为需要分割地址空间,迟早一个线程堆栈会碰到另一个线程堆栈。在单线程系统(至少是Linux glibc单线程)上,限制要小得多,因为堆栈可以不断增长。


关于Xbox 360 Xenon处理器上的堆栈与堆分配,我了解到一件有趣的事情,这可能也适用于其他多核系统,即在堆上分配会导致进入临界区以停止所有其他核,这样分配就不会发生冲突。因此,在一个紧密循环中,堆栈分配是固定大小数组的方法,因为它可以防止停顿。

如果您正在为多核/多进程编码,这可能是另一个需要考虑的加速,因为您的堆栈分配将只由运行您的作用域函数的核心可见,而不会影响任何其他内核/ cpu。




#include <ctime>
#include <iostream>

namespace {
    class empty { }; // even empty classes take up 1 byte of space, minimum

int main()
    std::clock_t start = std::clock();
    for (int i = 0; i < 100000; ++i)
        empty e;
    std::clock_t duration = std::clock() - start;
    std::cout << "stack allocation took " << duration << " clock ticks\n";
    start = std::clock();
    for (int i = 0; i < 100000; ++i) {
        empty* e = new empty;
        delete e;
    duration = std::clock() - start;
    std::cout << "heap allocation took " << duration << " clock ticks\n";





如果我关心纳秒精度,我就不会使用std::clock()。如果我想把这些结果作为博士论文发表,我会对此做更大的研究,我可能会比较GCC、Tendra/Ten15、LLVM、Watcom、Borland、Visual c++、Digital Mars、ICC和其他编译器。实际上,堆分配所花费的时间是堆栈分配的数百倍,我认为进一步研究这个问题没有任何用处。


Add a data member to empty, and access that data member in the loop; but if I only ever read from the data member the optimizer can do constant folding and remove the loop; if I only ever write to the data member, the optimizer may skip all but the very last iteration of the loop. Additionally, the question wasn't "stack allocation and data access vs. heap allocation and data access." Declare e volatile, but volatile is often compiled incorrectly (PDF). Take the address of e inside the loop (and maybe assign it to a variable that is declared extern and defined in another file). But even in this case, the compiler may notice that -- on the stack at least -- e will always be allocated at the same memory address, and then do constant folding like in (1) above. I get all iterations of the loop, but the object is never actually allocated.

Beyond the obvious, this test is flawed in that it measures both allocation and deallocation, and the original question didn't ask about deallocation. Of course variables allocated on the stack are automatically deallocated at the end of their scope, so not calling delete would (1) skew the numbers (stack deallocation is included in the numbers about stack allocation, so it's only fair to measure heap deallocation) and (2) cause a pretty bad memory leak, unless we keep a reference to the new pointer and call delete after we've got our time measurement.

在我的机器上,在Windows上使用g++ 3.4.4,对于任何小于100000个分配的堆栈和堆分配,我都得到“0个时钟滴答”,即使这样,对于堆栈分配,我也得到“0个时钟滴答”,对于堆分配,我得到“15个时钟滴答”。当我测量10,000,000个分配时,堆栈分配需要31个时钟滴答,堆分配需要1562个时钟滴答。


在我写这篇文章之后的几年里,Stack Overflow的首选是发布优化构建的性能。总的来说,我认为这是正确的。然而,我仍然认为,当你实际上不希望代码被优化时,让编译器去优化代码是愚蠢的。在我看来,这很像给代客泊车额外付费,却拒绝交出钥匙。在这个特殊情况下,我不希望优化器运行。


#include <cstdio>
#include <chrono>

namespace {
    void on_stack()
        int i;

    void on_heap()
        int* i = new int;
        delete i;

int main()
    auto begin = std::chrono::system_clock::now();
    for (int i = 0; i < 1000000000; ++i)
    auto end = std::chrono::system_clock::now();

    std::printf("on_stack took %f seconds\n", std::chrono::duration<double>(end - begin).count());

    begin = std::chrono::system_clock::now();
    for (int i = 0; i < 1000000000; ++i)
    end = std::chrono::system_clock::now();

    std::printf("on_heap took %f seconds\n", std::chrono::duration<double>(end - begin).count());
    return 0;


on_stack took 2.070003 seconds
on_heap took 57.980081 seconds

在我的系统上,当用命令行编译cl foo。cc /Od /MT /EHsc。


on_stack took 0.000000 seconds
on_heap took 51.608723 seconds


on_stack took 0.000003 seconds
on_heap took 0.000002 seconds



Although as per the abstract machine semantic rules, automatic objects still occupy memory, a conforming C++ implementation is allowed to ignore this fact when it can prove this does not matter (when it does not change the observable behavior of the program). This permission is granted by the as-if rule in ISO C++, which is also the general clause enabling the usual optimizations (and there is also an almost same rule in ISO C). Besides the as-if rule, ISO C++ also has copy elision rules to allow omission of specific creations of objects. The constructor and destructor calls involved are thereby omitted. As a result, the automatic objects (if any) in these constructors and destructors are also eliminated, compared to naive abstract semantics implied by the source code.

另一方面,免费商店的分配绝对是设计上的“分配”。在ISO c++规则下,这样的分配可以通过调用分配函数来实现。然而,自ISO c++ 14以来,有一个新的(非as-if)规则允许在特定情况下合并全局分配函数(即::operator new)调用。因此,部分动态分配操作也可以像自动对象一样是无操作的。




C++ does not expose reified activation records or some sorts of first-class continuations (e.g. by the famous call/cc), there is no way to directly manipulate the activation record frames - where the implementation need to place the automatic objects to. Once there is no (non-portable) interoperations with the underlying implementation ("native" non-portable code, such as inline assembly code), an omission of the underlying allocation of the frames can be quite trivial. For example, when the called function is inlined, the frames can be effectively merged into others, so there is no way to show what is the "allocation".

However, once interops are respected, things are getting complex. A typical implementation of C++ will expose the ability of interop on ISA (instruction-set architecture) with some calling conventions as the binary boundary shared with the native (ISA-level machine) code. This would be explicitly costly, notably, when maintaining the stack pointer, which is often directly held by an ISA-level register (with probably specific machine instructions to access). The stack pointer indicates the boundary of the top frame of the (currently active) function call. When a function call is entered, a new frame is needed and the stack pointer is added or subtracted (depending on the convention of ISA) by a value not less than the required frame size. The frame is then said allocated when the stack pointer after the operations. Parameters of functions may be passed onto the stack frame as well, depending on the calling convention used for the call. The frame can hold the memory of automatic objects (probably including the parameters) specified by the C++ source code. In the sense of such implementations, these objects are "allocated". When the control exits the function call, the frame is no longer needed, it is usually released by restoring the stack pointer back to the state before the call (saved previously according to the calling convention). This can be viewed as "deallocation". These operations make the activation record effectively a LIFO data structure, so it is often called "the (call) stack". The stack pointer effectively indicates the top position of the stack.

Because most C++ implementations (particularly the ones targeting ISA-level native code and using the assembly language as its immediate output) use similar strategies like this, such a confusing "allocation" scheme is popular. Such allocations (as well as deallocations) do spend machine cycles, and it can be expensive when the (non-optimized) calls occur frequently, even though modern CPU microarchitectures can have complex optimizations implemented by hardware for the common code pattern (like using a stack engine in implementing PUSH/POP instructions).

But anyway, in general, it is true that the cost of stack frame allocation is significantly less than a call to an allocation function operating the free store (unless it is totally optimized away), which itself can have hundreds of (if not millions of :-) operations to maintain the stack pointer and other states. Allocation functions are typically based on API provided by the hosted environment (e.g. runtime provided by the OS). Different to the purpose of holding automatic objects for functions calls, such allocations are general-purpose, so they will not have frame structure like a stack. Traditionally, they allocate space from the pool storage called the heap (or several heaps). Different from the "stack", the concept "heap" here does not indicate the data structure being used; it is derived from early language implementations decades ago. (BTW, the call stack is usually allocated with fixed or user-specified size from the heap by the environment in program/thread startup.) The nature of use cases makes allocations and deallocations from a heap far more complicated (than pushing/poppoing of stack frames), and hardly possible to be directly optimized by hardware.


The usual stack allocation always puts the new frame on the top, so it has a quite good locality. This is friendly to the cache. OTOH, memory allocated randomly in the free store has no such property. Since ISO C++17, there are pool resource templates provided by <memory_resource>. The direct purpose of such an interface is to allow the results of consecutive allocations being close together in memory. This acknowledges the fact that this strategy is generally good for performance with contemporary implementations, e.g. being friendly to cache in modern architectures. This is about the performance of access rather than allocation, though.


Expectation of concurrent access to memory can have different effects between the stack and heaps. A call stack is usually exclusively owned by one thread of execution in a typical C++ implementation. OTOH, heaps are often shared among the threads in a process. For such heaps, the allocation and deallocation functions have to protect the shared internal administrative data structure from the data race. As a result, heap allocations and deallocations may have additional overhead due to internal synchronization operations.





首先,在ISO c++中无法以可移植的方式在运行时指定大小的堆栈上分配空间。诸如alloca和g++的VLA(变长数组)等实现提供了扩展,但是有理由避免使用它们。(IIRC, Linux源代码最近删除了VLA的使用)(还要注意ISO C99确实有强制的VLA,但ISO C11是可选的支持。)

Second, there is no reliable and portable way to detect stack space exhaustion. This is often called stack overflow (hmm, the etymology of this site), but probably more accurately, stack overrun. In reality, this often causes invalid memory access, and the state of the program is then corrupted (... or maybe worse, a security hole). In fact, ISO C++ has no concept of "the stack" and makes it undefined behavior when the resource is exhausted. Be cautious about how much room should be left for automatic objects.


Nevertheless, deep recursive calls are sometimes desired. In implementations of languages requiring support of unbound active calls (where the call depth only limited by total memory), it is impossible to use the (contemporary) native call stack directly as the target language activation record like typical C++ implementations. To work around the problem, alternative ways of the construction of activation records are needed. For example, SML/NJ explicitly allocates frames on the heap and uses cactus stacks. The complicated allocation of such activation record frames is usually not as fast as the call stack frames. However, if such languages are implemented further with the guarantee of proper tail recursion, the direct stack allocation in the object language (that is, the "object" in the language does not stored as references, but native primitive values which can be one-to-one mapped to unshared C++ objects) is even more complicated with more performance penalty in general. When using C++ to implement such languages, it is difficult to estimate the performance impacts.