在这个网站上已经有很多性能问题了,但是在我看来,几乎所有的问题都是非常具体的,而且相当狭窄。几乎所有人都重复了避免过早优化的建议。

我们假设:

代码已经正常工作了 所选择的算法对于问题的环境已经是最优的 对代码进行了测量,并隔离了有问题的例程 所有优化的尝试也将被衡量,以确保它们不会使事情变得更糟

我在这里寻找的是策略和技巧,在一个关键算法中,当没有其他事情可做,但无论如何都要挤出最后百分之几。

理想情况下,尽量让答案与语言无关,并在适用的情况下指出所建议的策略的任何缺点。

我将添加一个带有我自己最初建议的回复,并期待Stack Overflow社区能想到的任何其他东西。


当前回答

很难对这个问题给出一般的答案。这实际上取决于你的问题领域和技术实现。一种与语言无关的通用技术:识别无法消除的代码热点,并手工优化汇编代码。

其他回答

目前最重要的限制因素是有限的内存带宽。多核只会让情况变得更糟,因为带宽是在核之间共享的。此外,用于实现缓存的有限芯片区域也分配给了内核和线程,这进一步恶化了这个问题。最后,保持不同缓存一致性所需的芯片间信号也会随着核数的增加而增加。这也增加了一个惩罚。

这些是您需要管理的影响。有时是通过对代码的微观管理,但有时是通过仔细考虑和重构。

很多注释已经提到了缓存友好的代码。至少有两种不同的风格:

避免内存读取延迟。 降低内存总线压力(带宽)。

第一个问题与如何使数据访问模式更规则有关,从而使硬件预取器更有效地工作。避免动态内存分配,这会将数据对象分散在内存中。使用线性容器代替链表、散列和树。

第二个问题与提高数据重用有关。修改算法以处理适合可用缓存的数据子集,并在数据仍在缓存中时尽可能多地重用这些数据。

更紧密地封装数据并确保在热循环中使用缓存线路中的所有数据,将有助于避免这些其他影响,并允许在缓存中安装更多有用的数据。

如果更好的硬件是一个选择,那么一定要去做。否则

Check you are using the best compiler and linker options. If hotspot routine in different library to frequent caller, consider moving or cloning it to the callers module. Eliminates some of the call overhead and may improve cache hits (cf how AIX links strcpy() statically into separately linked shared objects). This could of course decrease cache hits also, which is why one measure. See if there is any possibility of using a specialized version of the hotspot routine. Downside is more than one version to maintain. Look at the assembler. If you think it could be better, consider why the compiler did not figure this out, and how you could help the compiler. Consider: are you really using the best algorithm? Is it the best algorithm for your input size?

我花了一些时间优化在低带宽和长延迟网络(例如卫星、远程、离岸)上运行的客户端/服务器业务系统,并能够通过相当可重复的过程实现一些显著的性能改进。

Measure: Start by understanding the network's underlying capacity and topology. Talking to the relevant networking people in the business, and make use of basic tools such as ping and traceroute to establish (at a minimum) the network latency from each client location, during typical operational periods. Next, take accurate time measurements of specific end user functions that display the problematic symptoms. Record all of these measurements, along with their locations, dates and times. Consider building end-user "network performance testing" functionality into your client application, allowing your power users to participate in the process of improvement; empowering them like this can have a huge psychological impact when you're dealing with users frustrated by a poorly performing system. Analyze: Using any and all logging methods available to establish exactly what data is being transmitted and received during the execution of the affected operations. Ideally, your application can capture data transmitted and received by both the client and the server. If these include timestamps as well, even better. If sufficient logging isn't available (e.g. closed system, or inability to deploy modifications into a production environment), use a network sniffer and make sure you really understand what's going on at the network level. Cache: Look for cases where static or infrequently changed data is being transmitted repetitively and consider an appropriate caching strategy. Typical examples include "pick list" values or other "reference entities", which can be surprisingly large in some business applications. In many cases, users can accept that they must restart or refresh the application to update infrequently updated data, especially if it can shave significant time from the display of commonly used user interface elements. Make sure you understand the real behaviour of the caching elements already deployed - many common caching methods (e.g. HTTP ETag) still require a network round-trip to ensure consistency, and where network latency is expensive, you may be able to avoid it altogether with a different caching approach. Parallelise: Look for sequential transactions that don't logically need to be issued strictly sequentially, and rework the system to issue them in parallel. I dealt with one case where an end-to-end request had an inherent network delay of ~2s, which was not a problem for a single transaction, but when 6 sequential 2s round trips were required before the user regained control of the client application, it became a huge source of frustration. Discovering that these transactions were in fact independent allowed them to be executed in parallel, reducing the end-user delay to very close to the cost of a single round trip. Combine: Where sequential requests must be executed sequentially, look for opportunities to combine them into a single more comprehensive request. Typical examples include creation of new entities, followed by requests to relate those entities to other existing entities. Compress: Look for opportunities to leverage compression of the payload, either by replacing a textual form with a binary one, or using actual compression technology. Many modern (i.e. within a decade) technology stacks support this almost transparently, so make sure it's configured. I have often been surprised by the significant impact of compression where it seemed clear that the problem was fundamentally latency rather than bandwidth, discovering after the fact that it allowed the transaction to fit within a single packet or otherwise avoid packet loss and therefore have an outsize impact on performance. Repeat: Go back to the beginning and re-measure your operations (at the same locations and times) with the improvements in place, record and report your results. As with all optimisation, some problems may have been solved exposing others that now dominate.

In the steps above, I focus on the application related optimisation process, but of course you must ensure the underlying network itself is configured in the most efficient manner to support your application too. Engage the networking specialists in the business and determine if they're able to apply capacity improvements, QoS, network compression, or other techniques to address the problem. Usually, they will not understand your application's needs, so it's important that you're equipped (after the Analyse step) to discuss it with them, and also to make the business case for any costs you're going to be asking them to incur. I've encountered cases where erroneous network configuration caused the applications data to be transmitted over a slow satellite link rather than an overland link, simply because it was using a TCP port that was not "well known" by the networking specialists; obviously rectifying a problem like this can have a dramatic impact on performance, with no software code or configuration changes necessary at all.

不好说。这取决于代码的样子。如果我们可以假设代码已经存在,那么我们可以简单地查看它并从中找出如何优化它。

更好的缓存位置,循环展开,尽量消除长依赖链,以获得更好的指令级并行性。尽可能选择有条件的移动而不是分支。尽可能利用SIMD指令。

理解你的代码在做什么,理解它运行在什么硬件上。然后,决定需要做什么来提高代码的性能就变得相当简单了。这是我能想到的唯一一个真正具有普遍性的建议。

好吧,还有“在SO上显示代码,并为特定的代码段寻求优化建议”。

调整操作系统和框架。

这听起来可能有点夸张,但可以这样想:操作系统和框架被设计用来做很多事情。您的应用程序只做非常具体的事情。如果你能让操作系统完全满足你的应用程序的需求,并让你的应用程序理解框架(php,.net,java)是如何工作的,你就能从硬件上得到更好的东西。

例如,Facebook改变了Linux中的一些内核级别的东西,改变了memcached的工作方式(例如,他们写了一个memcached代理,使用udp而不是tcp)。

另一个例子是Window2008。Win2K8有一个版本,你可以安装运行X应用程序所需的基本操作系统(例如web应用程序,服务器应用程序)。这大大减少了操作系统在运行进程方面的开销,并为您提供了更好的性能。

当然,你应该在第一步就投入更多的硬件……