每个核心的最佳线程数

假设我有一个4核CPU，我想在最短的时间内运行某个进程。这个过程在理想情况下是可并行的，所以我可以在无数个线程上运行它的块，每个线程花费相同的时间。

因为我有4个内核，所以我不期望通过运行比内核更多的线程来提高速度，因为单个内核在给定时刻只能运行单个线程。我对硬件了解不多，所以这只是一个猜测。

在更多的线程而不是核心上运行并行进程是否有好处?换句话说，如果我使用4000个线程而不是4个线程运行，我的进程会更快、更慢，还是在大约相同的时间内完成?

当前回答

一次4000个线程是相当高的。

答案是肯定的，也不是。如果您在每个线程中执行大量阻塞I/O，那么是的，您可以在每个逻辑核心中执行3或4个线程时显示显著的加速。

If you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower. So use a profiler and see where the bottlenecks are in each possibly parallel piece. If you are doing heavy computations, then more than 1 thread per CPU won't help. If you are doing a lot of memory transfer, it won't help either. If you are doing a lot of I/O though such as for disk access or internet access, then yes multiple threads will help up to a certain extent, or at the least make the application more responsive.

2009-11-11 22:32:32

其他回答

通过运行htop或ps命令(返回机器上的进程数)，您将发现可以在机器上运行多少个线程。

您可以使用手册页关于'ps'命令。

man ps

如果你想计算所有用户进程的数量，你可以使用这些命令之一:

Ps -aux| wc -l ps -eLf | wc -l

计算用户进程数:

ps—root用户| wc -l

此外，你还可以使用“htop”[参考]:

在Ubuntu或Debian上安装:

sudo apt-get install htop

在Redhat或CentOS上安装:

yum install htop
dnf install htop      [On Fedora 22+ releases]

如果您想从源代码编译htop，可以在这里找到它。

2017-10-23 08:31:34

我想在这里补充另一个观点。答案取决于这个问题是假设弱缩放还是强缩放。

从维基百科:

弱伸缩性:对于每个处理器的固定问题大小，解决时间如何随着处理器数量的变化而变化。

强伸缩性:对于固定的总问题规模，解决时间如何随着处理器数量的变化而变化。

If the question is assuming weak scaling then @Gonzalo's answer suffices. However if the question is assuming strong scaling, there's something more to add. In strong scaling you're assuming a fixed workload size so if you increase the number of threads, the size of the data that each thread needs to work on decreases. On modern CPUs memory accesses are expensive and would be preferable to maintain locality by keeping the data in caches. Therefore, the likely optimal number of threads can be found when the dataset of each thread fits in each core's cache (I'm not going into the details of discussing whether it's L1/L2/L3 cache(s) of the system).

即使线程数超过内核数，这也是成立的。例如，假设程序中有8个任意单位(AU)的工作将在4核机器上执行。

案例1:运行四个线程，每个线程需要完成2AU。每个线程需要10秒来完成(有很多缓存丢失)。对于四个内核，总时间为10s (10s * 4个线程/ 4个内核)。

情况2:运行8个线程，每个线程需要完成1AU。每个线程只需要2s(而不是5s，因为缓存丢失的数量减少了)。如果是四核，总时间为4s (2s * 8线程/ 4核)。

我简化了这个问题，忽略了其他答案中提到的开销(例如，上下文切换)，但希望您明白，根据您正在处理的数据大小，拥有比可用内核数量更多的线程可能是有益的。

2017-03-17 01:38:44

我同意@Gonzalo的回答。我有一个不做I/O的进程，下面是我的发现:

请注意，所有线程都工作在一个数组上，但范围不同(两个线程不访问相同的索引)，因此如果它们工作在不同的数组上，结果可能会有所不同。

这台1.86版本的机器是一台带有SSD的macbook air。另一台mac是一台iMac，硬盘正常(我想转速是7200转)。这台装有windows操作系统的机器还有一个7200转的硬盘。

在这个测试中，最佳的数量等于机器中的核数。

2012-05-20 02:55:45

一次4000个线程是相当高的。

答案是肯定的，也不是。如果您在每个线程中执行大量阻塞I/O，那么是的，您可以在每个逻辑核心中执行3或4个线程时显示显著的加速。