在这个网站上已经有很多性能问题了,但是在我看来,几乎所有的问题都是非常具体的,而且相当狭窄。几乎所有人都重复了避免过早优化的建议。

我们假设:

代码已经正常工作了 所选择的算法对于问题的环境已经是最优的 对代码进行了测量,并隔离了有问题的例程 所有优化的尝试也将被衡量,以确保它们不会使事情变得更糟

我在这里寻找的是策略和技巧,在一个关键算法中,当没有其他事情可做,但无论如何都要挤出最后百分之几。

理想情况下,尽量让答案与语言无关,并在适用的情况下指出所建议的策略的任何缺点。

我将添加一个带有我自己最初建议的回复,并期待Stack Overflow社区能想到的任何其他东西。


当前回答

我花了一些时间优化在低带宽和长延迟网络(例如卫星、远程、离岸)上运行的客户端/服务器业务系统,并能够通过相当可重复的过程实现一些显著的性能改进。

Measure: Start by understanding the network's underlying capacity and topology. Talking to the relevant networking people in the business, and make use of basic tools such as ping and traceroute to establish (at a minimum) the network latency from each client location, during typical operational periods. Next, take accurate time measurements of specific end user functions that display the problematic symptoms. Record all of these measurements, along with their locations, dates and times. Consider building end-user "network performance testing" functionality into your client application, allowing your power users to participate in the process of improvement; empowering them like this can have a huge psychological impact when you're dealing with users frustrated by a poorly performing system. Analyze: Using any and all logging methods available to establish exactly what data is being transmitted and received during the execution of the affected operations. Ideally, your application can capture data transmitted and received by both the client and the server. If these include timestamps as well, even better. If sufficient logging isn't available (e.g. closed system, or inability to deploy modifications into a production environment), use a network sniffer and make sure you really understand what's going on at the network level. Cache: Look for cases where static or infrequently changed data is being transmitted repetitively and consider an appropriate caching strategy. Typical examples include "pick list" values or other "reference entities", which can be surprisingly large in some business applications. In many cases, users can accept that they must restart or refresh the application to update infrequently updated data, especially if it can shave significant time from the display of commonly used user interface elements. Make sure you understand the real behaviour of the caching elements already deployed - many common caching methods (e.g. HTTP ETag) still require a network round-trip to ensure consistency, and where network latency is expensive, you may be able to avoid it altogether with a different caching approach. Parallelise: Look for sequential transactions that don't logically need to be issued strictly sequentially, and rework the system to issue them in parallel. I dealt with one case where an end-to-end request had an inherent network delay of ~2s, which was not a problem for a single transaction, but when 6 sequential 2s round trips were required before the user regained control of the client application, it became a huge source of frustration. Discovering that these transactions were in fact independent allowed them to be executed in parallel, reducing the end-user delay to very close to the cost of a single round trip. Combine: Where sequential requests must be executed sequentially, look for opportunities to combine them into a single more comprehensive request. Typical examples include creation of new entities, followed by requests to relate those entities to other existing entities. Compress: Look for opportunities to leverage compression of the payload, either by replacing a textual form with a binary one, or using actual compression technology. Many modern (i.e. within a decade) technology stacks support this almost transparently, so make sure it's configured. I have often been surprised by the significant impact of compression where it seemed clear that the problem was fundamentally latency rather than bandwidth, discovering after the fact that it allowed the transaction to fit within a single packet or otherwise avoid packet loss and therefore have an outsize impact on performance. Repeat: Go back to the beginning and re-measure your operations (at the same locations and times) with the improvements in place, record and report your results. As with all optimisation, some problems may have been solved exposing others that now dominate.

In the steps above, I focus on the application related optimisation process, but of course you must ensure the underlying network itself is configured in the most efficient manner to support your application too. Engage the networking specialists in the business and determine if they're able to apply capacity improvements, QoS, network compression, or other techniques to address the problem. Usually, they will not understand your application's needs, so it's important that you're equipped (after the Analyse step) to discuss it with them, and also to make the business case for any costs you're going to be asking them to incur. I've encountered cases where erroneous network configuration caused the applications data to be transmitted over a slow satellite link rather than an overland link, simply because it was using a TCP port that was not "well known" by the networking specialists; obviously rectifying a problem like this can have a dramatic impact on performance, with no software code or configuration changes necessary at all.

其他回答

更多的建议:

Avoid I/O: Any I/O (disk, network, ports, etc.) is always going to be far slower than any code that is performing calculations, so get rid of any I/O that you do not strictly need. Move I/O up-front: Load up all the data you are going to need for a calculation up-front, so that you do not have repeated I/O waits within the core of a critical algorithm (and maybe as a result repeated disk seeks, when loading all the data in one hit may avoid seeking). Delay I/O: Do not write out your results until the calculation is over, store them in a data structure and then dump that out in one go at the end when the hard work is done. Threaded I/O: For those daring enough, combine 'I/O up-front' or 'Delay I/O' with the actual calculation by moving the loading into a parallel thread, so that while you are loading more data you can work on a calculation on the data you already have, or while you calculate the next batch of data you can simultaneously write out the results from the last batch.

不好说。这取决于代码的样子。如果我们可以假设代码已经存在,那么我们可以简单地查看它并从中找出如何优化它。

更好的缓存位置,循环展开,尽量消除长依赖链,以获得更好的指令级并行性。尽可能选择有条件的移动而不是分支。尽可能利用SIMD指令。

理解你的代码在做什么,理解它运行在什么硬件上。然后,决定需要做什么来提高代码的性能就变得相当简单了。这是我能想到的唯一一个真正具有普遍性的建议。

好吧,还有“在SO上显示代码,并为特定的代码段寻求优化建议”。

OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November 1993, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:

The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds. Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 seconds. Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.

现在我似乎遇到了瓶颈。样本告诉我它到底在做什么,但我似乎找不到任何可以改进的地方。然后,我考虑了程序的基本设计及其事务驱动结构,并询问它所做的所有列表搜索实际上是否都是由问题的需求强制执行的。

然后我偶然发现了一种重新设计,在这种设计中,程序代码实际上是从一组较小的源代码中生成的(通过预处理器宏),在这种设计中,程序不会不断地找出程序员知道的相当可预测的事情。换句话说,不要“解释”要做的事情的顺序,要“编译”它。

重新设计完成了,源代码缩减了1 / 4,时间减少到10秒。

现在,因为它变得如此之快,很难进行抽样,所以我给它10倍的工作,但下面的时间是基于原始工作负载的。

进一步的诊断表明,它是在队列管理上花费时间的。内联这些将时间缩短到7秒。 现在一个很大的时间消耗是我一直在做的诊断打印。冲水- 4秒 现在最浪费时间的是调用malloc和free。回收对象- 2.6秒。 继续进行抽样,我仍然发现了严格意义上没有必要的操作——1.1秒。

总加速系数:43.6

Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on Stack Overflow.

附注:可能有人想知道我为什么不用侧写器。答案是,几乎所有这些“问题”都是函数调用站点,堆栈样本可以精确定位。即使在今天,分析人员也只是勉强接受这样一个观点:语句和调用指令比整个函数更重要,更容易定位,也更容易修复。

我实际上构建了一个剖析器来做这件事,但是要真正了解代码正在做什么,没有什么可以替代您的手指。样本数量少并不是问题,因为被发现的问题没有一个小到容易被忽略的程度。

添加:jerryjvl要求一些例子。这是第一个问题。它由少量独立的代码行组成,加在一起占用了一半的时间:

 /* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)

These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.

下面是第二个问题,分两行:

 /* ADD TASK TO TASK LIST */
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)

它们通过在列表的末尾附加项目来构建列表。(解决方法是将项目收集到数组中,并一次性构建列表。)有趣的是,这些语句只花费了原始时间的3/48(即在调用堆栈上),所以它们实际上在一开始并不是一个大问题。然而,在消除了第一个问题后,它们只花费了3/20的时间,所以现在是一条“大鱼”。总的来说,就是这样。

我可以补充说,这个项目是从我参与的一个真实项目中提炼出来的。在那个项目中,性能问题要严重得多(加速也是如此),比如在内部循环中调用数据库访问例程来查看任务是否完成。

参考补充道: 源代码,无论是原始的还是重新设计的,都可以在www.ddj.com上找到,1993年,文件9311.zip, files slug。Asc和slug.zip。

编辑2011/11/26: 现在有一个SourceForge项目包含了Visual c++中的源代码,以及它是如何调优的详细描述。它只经历了上述场景的前半部分,并不完全遵循相同的顺序,但仍然获得了2-3个数量级的加速。

以下是我使用的一些快速而粗糙的优化技术。我认为这是“第一关”优化。

了解时间都花在了什么地方。是文件IO吗?是CPU时间吗?是因为网络吗?是数据库吗?如果IO不是瓶颈,优化IO是没有用的。

了解您的环境了解在哪里进行优化通常取决于开发环境。例如,在VB6中,通过引用传递比通过值传递慢,但是在C和c++中,通过引用传递要快得多。在C语言中,如果返回代码表明失败,尝试一些东西并做一些不同的事情是合理的,而在Dot Net中,捕获异常比尝试前检查有效条件要慢得多。

在频繁查询的数据库字段上构建索引。你几乎总是可以用空间来换取速度。

在要优化的循环内部,我避免了必须进行任何查找。找到循环外的偏移量和/或索引,并重用循环内的数据。

最小化IO尝试以一种减少必须读或写的次数的方式进行设计,特别是在网络连接上

减少抽象代码必须通过的抽象层越多,它就越慢。在关键循环内部,减少抽象(例如,揭示避免额外代码的低级方法)

对于带有用户界面的项目,生成一个新线程来执行较慢的任务使应用程序感觉反应更快,尽管不是。

你通常可以用空间来换取速度。如果有计算或其他密集的操作,看看是否可以在进入关键循环之前预先计算一些信息。

你在什么硬件上运行?您是否可以使用特定于平台化的优化(如向量化)? 你能找到更好的编译器吗?比如从GCC换成Intel? 你能让你的算法并行运行吗? 可以通过重新组织数据来减少缓存丢失吗? 可以禁用断言吗? 对编译器和平台进行微优化。在if/else语句中,把最常见的语句放在前面