假设我有一个4核CPU,我想在最短的时间内运行某个进程。这个过程在理想情况下是可并行的,所以我可以在无数个线程上运行它的块,每个线程花费相同的时间。

因为我有4个内核,所以我不期望通过运行比内核更多的线程来提高速度,因为单个内核在给定时刻只能运行单个线程。我对硬件了解不多,所以这只是一个猜测。

在更多的线程而不是核心上运行并行进程是否有好处?换句话说,如果我使用4000个线程而不是4个线程运行,我的进程会更快、更慢,还是在大约相同的时间内完成?


当前回答

答案取决于程序中使用的算法的复杂性。我提出了一个计算最佳线程数的方法,即对任意数量的线程“n”和“m”进行两次处理时间Tn和Tm的测量。对于线性算法,最佳线程数为N =√((mn(Tm*(N -1) - Tn*(m-1)))/(nTn-mTm))。

请阅读我关于各种算法的最优数计算的文章:pavelkazenin.wordpress.com

其他回答

从计算和内存限制的角度(科学计算)来说,4000个线程会让应用程序运行得非常慢。部分问题是上下文切换的开销非常高,而且很可能是内存位置非常差。

但这也取决于您的体系结构。我听说Niagara处理器应该能够使用某种先进的流水线技术在单核上处理多个线程。但是我没有使用这些处理器的经验。

如果你的线程不做I/O,同步等,没有其他的运行,1个线程一个核可以让你获得最好的性能。然而,情况很可能并非如此。添加更多的线程通常会有所帮助,但在某种程度上,它们会导致性能下降。

Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.

有一件事是肯定的:4k线程将花费更长的时间。这有很多上下文转换。

我想在这里补充另一个观点。答案取决于这个问题是假设弱缩放还是强缩放。

从维基百科:

弱伸缩性:对于每个处理器的固定问题大小,解决时间如何随着处理器数量的变化而变化。

强伸缩性:对于固定的总问题规模,解决时间如何随着处理器数量的变化而变化。

If the question is assuming weak scaling then @Gonzalo's answer suffices. However if the question is assuming strong scaling, there's something more to add. In strong scaling you're assuming a fixed workload size so if you increase the number of threads, the size of the data that each thread needs to work on decreases. On modern CPUs memory accesses are expensive and would be preferable to maintain locality by keeping the data in caches. Therefore, the likely optimal number of threads can be found when the dataset of each thread fits in each core's cache (I'm not going into the details of discussing whether it's L1/L2/L3 cache(s) of the system).

即使线程数超过内核数,这也是成立的。例如,假设程序中有8个任意单位(AU)的工作将在4核机器上执行。

案例1:运行四个线程,每个线程需要完成2AU。每个线程需要10秒来完成(有很多缓存丢失)。对于四个内核,总时间为10s (10s * 4个线程/ 4个内核)。

情况2:运行8个线程,每个线程需要完成1AU。每个线程只需要2s(而不是5s,因为缓存丢失的数量减少了)。如果是四核,总时间为4s (2s * 8线程/ 4核)。

我简化了这个问题,忽略了其他答案中提到的开销(例如,上下文切换),但希望您明白,根据您正在处理的数据大小,拥有比可用内核数量更多的线程可能是有益的。

答案取决于程序中使用的算法的复杂性。我提出了一个计算最佳线程数的方法,即对任意数量的线程“n”和“m”进行两次处理时间Tn和Tm的测量。对于线性算法,最佳线程数为N =√((mn(Tm*(N -1) - Tn*(m-1)))/(nTn-mTm))。

请阅读我关于各种算法的最优数计算的文章:pavelkazenin.wordpress.com

我知道这个问题很老了,但事情从2009年开始就有了变化。

现在有两件事需要考虑:核心的数量,以及每个核心中可以运行的线程的数量。

With Intel processors, the number of threads is defined by the Hyperthreading which is just 2 (when available). But Hyperthreading cuts your execution time by two, even when not using 2 threads! (i.e. 1 pipeline shared between two processes -- this is good when you have more processes, not so good otherwise. More cores are definitively better!) Note that modern CPUs generally have more pipelines to divide the workload, so it's no really divided by two anymore. But Hyperthreading still shares a lot of the CPU units between the two threads (some call those logical CPUs).

在其他处理器上,您可能有2、4甚至8个线程。因此,如果你有8个内核,每个内核支持8个线程,你可以有64个进程并行运行,而不需要上下文切换。

“没有上下文切换”显然是不正确的,如果你运行的是一个标准的操作系统,它会对各种你无法控制的事情进行上下文切换。但这是主要的思想。一些操作系统允许你分配处理器,这样只有你的应用程序可以访问/使用处理器!

From my own experience, if you have a lot of I/O, multiple threads is good. If you have very heavy memory intensive work (read source 1, read source 2, fast computation, write) then having more threads doesn't help. Again, this depends on how much data you read/write simultaneously (i.e. if you use SSE 4.2 and read 256 bits values, that stops all threads in their step... in other words, 1 thread is probably a lot easier to implement and probably nearly as speedy if not actually faster. This will depend on your process & memory architecture, some advanced servers manage separate memory ranges for separate cores so separate threads will be faster assuming your data is properly filed... which is why, on some architectures, 4 processes will run faster than 1 process with 4 threads.)