哪个更快:堆栈分配还是堆分配

这个问题可能听起来相当简单，但这是我与另一位开发人员的争论。

我小心地在我可以的地方进行堆栈分配，而不是堆分配。他和我说话，看着我的肩膀，并评论说，这是没有必要的，因为他们的表现是一样的。

在我的印象中，增加堆栈是一个常数时间，而堆分配的性能取决于当前堆的复杂性，包括分配(找到合适大小的孔)和反分配(缩小孔以减少碎片，如果我没有弄错的话，许多标准库实现在删除过程中需要时间来做这件事)。

在我看来，这可能非常依赖于编译器。在这个项目中，我特别使用了一个用于PPC架构的Metrowerks编译器。对这种组合的深入了解将是最有帮助的，但一般来说，对于GCC和msvc++，情况如何?堆分配不如堆栈分配高效吗?没有区别吗?还是差异如此之小以至于变成了毫无意义的微观优化。

当前回答

我认为生命期很重要，被分配的东西是否必须以复杂的方式构造。例如，在事务驱动的建模中，您通常必须填写并将带有一堆字段的事务结构传递给操作函数。以OSCI SystemC TLM-2.0标准为例。

在栈上靠近操作调用的地方分配这些资源往往会导致巨大的开销，因为这种构造非常昂贵。好的方法是在堆上分配和重用事务对象，或者通过池或简单的策略，如“这个模块只需要一个事务对象”。

这比在每个操作调用上分配对象快很多倍。

原因很简单，该对象具有昂贵的结构和相当长的使用寿命。

我会说:两种都试试，看看哪种最适合你，因为这真的取决于你代码的行为。

2008-10-02 06:43:14

其他回答

可能堆分配和堆栈分配的最大问题是，堆分配在一般情况下是一个无界操作，因此在有时间问题的地方不能使用它。

对于时间不是问题的其他应用程序，它可能没有那么重要，但如果您分配了很多堆，这将影响执行速度。总是尝试将堆栈用于短期和经常分配的内存(例如在循环中)，并尽可能长时间地在应用程序启动期间进行堆分配。

2008-10-02 08:34:12

一般来说，正如上面几乎每个答案所提到的，堆栈分配比堆分配快。堆栈的push或pop是O(1)，而从堆中分配或释放可能需要遍历之前的分配。但是，您通常不应该在紧凑的性能密集型循环中进行分配，因此选择通常取决于其他因素。

做出这样的区分可能会有好处:您可以在堆上使用“堆栈分配器”。严格地说，我认为堆栈分配是指分配的实际方法，而不是分配的位置。如果你在实际的程序堆栈上分配了很多东西，这可能会因为各种各样的原因而变得很糟糕。另一方面，在可能的情况下使用堆栈方法在堆上进行分配是分配方法的最佳选择。

既然你提到了《Metrowerks》和《PPC》，我猜你指的是Wii。在这种情况下，内存是非常宝贵的，在任何可能的情况下使用堆栈分配方法都可以保证您不会在片段上浪费内存。当然，这样做需要比“普通”堆分配方法更加小心。对每种情况进行权衡是明智的。

2009-03-02 01:36:43

c++语言特有的关注点

首先，c++中没有所谓的“堆栈”或“堆”分配。如果你谈论的是块作用域中的自动对象，它们甚至没有被“分配”。(顺便说一下，C语言中的自动存储时间肯定与“分配”不一样;后者在c++中是“动态的”。)动态分配的内存在自由存储区上，而不一定在“堆”上，尽管后者通常是(默认的)实现。

Although as per the abstract machine semantic rules, automatic objects still occupy memory, a conforming C++ implementation is allowed to ignore this fact when it can prove this does not matter (when it does not change the observable behavior of the program). This permission is granted by the as-if rule in ISO C++, which is also the general clause enabling the usual optimizations (and there is also an almost same rule in ISO C). Besides the as-if rule, ISO C++ also has copy elision rules to allow omission of specific creations of objects. The constructor and destructor calls involved are thereby omitted. As a result, the automatic objects (if any) in these constructors and destructors are also eliminated, compared to naive abstract semantics implied by the source code.

另一方面，免费商店的分配绝对是设计上的“分配”。在ISO c++规则下，这样的分配可以通过调用分配函数来实现。然而，自ISO c++ 14以来，有一个新的(非as-if)规则允许在特定情况下合并全局分配函数(即::operator new)调用。因此，部分动态分配操作也可以像自动对象一样是无操作的。

分配函数用于分配内存资源。可以使用分配器根据分配进一步分配对象。对于自动对象，它们是直接呈现的——尽管底层内存可以被访问，并被用来为其他对象提供内存(通过放置new)，但这作为自由存储没有太大意义，因为没有办法将资源移动到其他地方。

所有其他问题都超出了c++的范围。尽管如此，它们仍然是重要的。

c++的实现

C++ does not expose reified activation records or some sorts of first-class continuations (e.g. by the famous call/cc), there is no way to directly manipulate the activation record frames - where the implementation need to place the automatic objects to. Once there is no (non-portable) interoperations with the underlying implementation ("native" non-portable code, such as inline assembly code), an omission of the underlying allocation of the frames can be quite trivial. For example, when the called function is inlined, the frames can be effectively merged into others, so there is no way to show what is the "allocation".

However, once interops are respected, things are getting complex. A typical implementation of C++ will expose the ability of interop on ISA (instruction-set architecture) with some calling conventions as the binary boundary shared with the native (ISA-level machine) code. This would be explicitly costly, notably, when maintaining the stack pointer, which is often directly held by an ISA-level register (with probably specific machine instructions to access). The stack pointer indicates the boundary of the top frame of the (currently active) function call. When a function call is entered, a new frame is needed and the stack pointer is added or subtracted (depending on the convention of ISA) by a value not less than the required frame size. The frame is then said allocated when the stack pointer after the operations. Parameters of functions may be passed onto the stack frame as well, depending on the calling convention used for the call. The frame can hold the memory of automatic objects (probably including the parameters) specified by the C++ source code. In the sense of such implementations, these objects are "allocated". When the control exits the function call, the frame is no longer needed, it is usually released by restoring the stack pointer back to the state before the call (saved previously according to the calling convention). This can be viewed as "deallocation". These operations make the activation record effectively a LIFO data structure, so it is often called "the (call) stack". The stack pointer effectively indicates the top position of the stack.

Because most C++ implementations (particularly the ones targeting ISA-level native code and using the assembly language as its immediate output) use similar strategies like this, such a confusing "allocation" scheme is popular. Such allocations (as well as deallocations) do spend machine cycles, and it can be expensive when the (non-optimized) calls occur frequently, even though modern CPU microarchitectures can have complex optimizations implemented by hardware for the common code pattern (like using a stack engine in implementing PUSH/POP instructions).

But anyway, in general, it is true that the cost of stack frame allocation is significantly less than a call to an allocation function operating the free store (unless it is totally optimized away), which itself can have hundreds of (if not millions of :-) operations to maintain the stack pointer and other states. Allocation functions are typically based on API provided by the hosted environment (e.g. runtime provided by the OS). Different to the purpose of holding automatic objects for functions calls, such allocations are general-purpose, so they will not have frame structure like a stack. Traditionally, they allocate space from the pool storage called the heap (or several heaps). Different from the "stack", the concept "heap" here does not indicate the data structure being used; it is derived from early language implementations decades ago. (BTW, the call stack is usually allocated with fixed or user-specified size from the heap by the environment in program/thread startup.) The nature of use cases makes allocations and deallocations from a heap far more complicated (than pushing/poppoing of stack frames), and hardly possible to be directly optimized by hardware.

对内存访问的影响

The usual stack allocation always puts the new frame on the top, so it has a quite good locality. This is friendly to the cache. OTOH, memory allocated randomly in the free store has no such property. Since ISO C++17, there are pool resource templates provided by <memory_resource>. The direct purpose of such an interface is to allow the results of consecutive allocations being close together in memory. This acknowledges the fact that this strategy is generally good for performance with contemporary implementations, e.g. being friendly to cache in modern architectures. This is about the performance of access rather than allocation, though.

并发性

Expectation of concurrent access to memory can have different effects between the stack and heaps. A call stack is usually exclusively owned by one thread of execution in a typical C++ implementation. OTOH, heaps are often shared among the threads in a process. For such heaps, the allocation and deallocation functions have to protect the shared internal administrative data structure from the data race. As a result, heap allocations and deallocations may have additional overhead due to internal synchronization operations.

空间效率

由于用例和内部数据结构的性质，堆可能会受到内部内存碎片的影响，而堆栈则不会。这对内存分配的性能没有直接影响，但在虚拟内存的系统中，低空间效率可能会降低内存访问的整体性能。当HDD被用作物理内存交换时，这种情况尤其糟糕。它会导致相当长的延迟——有时是数十亿个周期。

堆栈分配的限制

尽管在现实中，堆栈分配在性能上通常优于堆分配，但这并不意味着堆栈分配总是可以取代堆分配。

首先，在ISO c++中无法以可移植的方式在运行时指定大小的堆栈上分配空间。诸如alloca和g++的VLA(变长数组)等实现提供了扩展，但是有理由避免使用它们。(IIRC, Linux源代码最近删除了VLA的使用)(还要注意ISO C99确实有强制的VLA，但ISO C11是可选的支持。)

Second, there is no reliable and portable way to detect stack space exhaustion. This is often called stack overflow (hmm, the etymology of this site), but probably more accurately, stack overrun. In reality, this often causes invalid memory access, and the state of the program is then corrupted (... or maybe worse, a security hole). In fact, ISO C++ has no concept of "the stack" and makes it undefined behavior when the resource is exhausted. Be cautious about how much room should be left for automatic objects.

如果堆栈空间用完，则堆栈中分配的对象太多，这可能是由于过多的活动函数调用或不恰当地使用自动对象造成的。这种情况可能表明存在错误，例如没有正确退出条件的递归函数调用。

Nevertheless, deep recursive calls are sometimes desired. In implementations of languages requiring support of unbound active calls (where the call depth only limited by total memory), it is impossible to use the (contemporary) native call stack directly as the target language activation record like typical C++ implementations. To work around the problem, alternative ways of the construction of activation records are needed. For example, SML/NJ explicitly allocates frames on the heap and uses cactus stacks. The complicated allocation of such activation record frames is usually not as fast as the call stack frames. However, if such languages are implemented further with the guarantee of proper tail recursion, the direct stack allocation in the object language (that is, the "object" in the language does not stored as references, but native primitive values which can be one-to-one mapped to unshared C++ objects) is even more complicated with more performance penalty in general. When using C++ to implement such languages, it is difficult to estimate the performance impacts.

2018-11-03 15:21:56

尽管堆分配器可以简单地使用基于堆栈的分配技术，但堆栈分配几乎总是与堆分配一样快或更快。

However, there are larger issues when dealing with the overall performance of stack vs. heap based allocation (or in slightly better terms, local vs. external allocation). Usually, heap (external) allocation is slow because it is dealing with many different kinds of allocations and allocation patterns. Reducing the scope of the allocator you are using (making it local to the algorithm/code) will tend to increase performance without any major changes. Adding better structure to your allocation patterns, for example, forcing a LIFO ordering on allocation and deallocation pairs can also improve your allocator's performance by using the allocator in a simpler and more structured way. Or, you can use or write an allocator tuned for your particular allocation pattern; most programs allocate a few discrete sizes frequently, so a heap that is based on a lookaside buffer of a few fixed (preferably known) sizes will perform extremely well. Windows uses its low-fragmentation-heap for this very reason.

另一方面，如果线程太多，在32位内存范围上基于堆栈的分配也充满了危险。堆栈需要一个连续的内存范围，因此线程越多，就需要更多的虚拟地址空间来让它们在没有堆栈溢出的情况下运行。对于64位的程序来说，这(目前)不是问题，但是对于具有大量线程的长时间运行的程序来说，它肯定会造成严重破坏。由于碎片化而导致虚拟地址空间耗尽总是一件令人痛苦的事情。

2010-08-10 16:27:39

不要做过早的假设，因为其他应用程序代码和使用可能会影响您的功能。因此，孤立地看待函数是没有用的。

如果你是认真的应用程序，那么VTune它或使用任何类似的分析工具，并查看热点。

糯米

2009-02-04 17:21:37

哪个更快:堆栈分配还是堆分配

推荐文章

最新文章

标签