最后的性能优化策略

在这个网站上已经有很多性能问题了，但是在我看来，几乎所有的问题都是非常具体的，而且相当狭窄。几乎所有人都重复了避免过早优化的建议。

我们假设:

代码已经正常工作了所选择的算法对于问题的环境已经是最优的对代码进行了测量，并隔离了有问题的例程所有优化的尝试也将被衡量，以确保它们不会使事情变得更糟

我在这里寻找的是策略和技巧，在一个关键算法中，当没有其他事情可做，但无论如何都要挤出最后百分之几。

理想情况下，尽量让答案与语言无关，并在适用的情况下指出所建议的策略的任何缺点。

我将添加一个带有我自己最初建议的回复，并期待Stack Overflow社区能想到的任何其他东西。

当前回答

虽然我喜欢Mike Dunlavey的回答，但事实上这是一个很好的答案，并且有支持的例子，我认为它可以简单地表达出来:

首先找出哪些事情最耗费时间，并了解原因。

它是时间消耗的识别过程，可以帮助您了解必须在哪里改进算法。这是我能找到的唯一一个全面的语言不可知论答案，这个问题已经被认为是完全优化的。同时假设您希望在追求速度的过程中独立于体系结构。

因此，虽然算法可能被优化了，但它的实现可能没有。标识可以让您知道哪个部分是哪个部分:算法或实现。所以，占用时间最多的就是你审查的首选对象。但是既然你说你想把最后的%挤出来，你可能还想检查一下较小的部分，那些你一开始没有仔细检查过的部分。

最后，对实现相同解决方案的不同方法的性能数据进行一些尝试和错误，或者可能的不同算法，可以带来有助于识别浪费时间和节省时间的见解。

HPH, asoudmove。

2011-01-26 04:35:20

其他回答

我大半辈子都在这里度过。大致的方法是运行你的分析器并记录它:

Cache misses. Data cache is the #1 source of stalls in most programs. Improve cache hit rate by reorganizing offending data structures to have better locality; pack structures and numerical types down to eliminate wasted bytes (and therefore wasted cache fetches); prefetch data wherever possible to reduce stalls. Load-hit-stores. Compiler assumptions about pointer aliasing, and cases where data is moved between disconnected register sets via memory, can cause a certain pathological behavior that causes the entire CPU pipeline to clear on a load op. Find places where floats, vectors, and ints are being cast to one another and eliminate them. Use __restrict liberally to promise the compiler about aliasing. Microcoded operations. Most processors have some operations that cannot be pipelined, but instead run a tiny subroutine stored in ROM. Examples on the PowerPC are integer multiply, divide, and shift-by-variable-amount. The problem is that the entire pipeline stops dead while this operation is executing. Try to eliminate use of these operations or at least break them down into their constituent pipelined ops so you can get the benefit of superscalar dispatch on whatever the rest of your program is doing. Branch mispredicts. These too empty the pipeline. Find cases where the CPU is spending a lot of time refilling the pipe after a branch, and use branch hinting if available to get it to predict correctly more often. Or better yet, replace branches with conditional-moves wherever possible, especially after floating point operations because their pipe is usually deeper and reading the condition flags after fcmp can cause a stall. Sequential floating-point ops. Make these SIMD.

我还喜欢做一件事:

将编译器设置为输出程序集清单，并查看它为代码中的热点函数发出了什么。所有那些聪明的优化，“一个好的编译器应该能够自动为你做”?实际的编译器可能不会执行这些操作。我见过GCC发出真正的WTF代码。

2009-05-29 22:19:44

你知道吗，一根CAT6电缆能够比缺省的Cat5e UTP电缆更好地屏蔽外部干扰10倍?

对于任何非离线项目，尽管拥有最好的软件和硬件，但如果你的throughoutput很弱，那么这条细线就会挤压数据并给你带来延迟，尽管只有几毫秒……

此外，CAT6电缆的最大吞吐量更高，因为您实际上更有可能收到铜芯电缆，而不是CCA，铜芯包覆铝，这通常出现在所有标准CAT5e电缆中。

如果您面临丢包，丢包，那么提高24/7操作的吞吐量可靠性可以使您所寻找的不同。

对于那些追求家庭/办公室连接可靠性的人来说(并且愿意对今年的快餐店说不，在年底你可以在那里)，以知名品牌的CAT7电缆的形式为自己提供LAN连接的顶峰。

2011-01-29 02:23:07

不好说。这取决于代码的样子。如果我们可以假设代码已经存在，那么我们可以简单地查看它并从中找出如何优化它。

更好的缓存位置，循环展开，尽量消除长依赖链，以获得更好的指令级并行性。尽可能选择有条件的移动而不是分支。尽可能利用SIMD指令。

理解你的代码在做什么，理解它运行在什么硬件上。然后，决定需要做什么来提高代码的性能就变得相当简单了。这是我能想到的唯一一个真正具有普遍性的建议。