假设a1、b1、c1和d1指向堆内存,我的数字代码具有以下核心循环。

const int n = 100000;

for (int j = 0; j < n; j++) {
    a1[j] += b1[j];
    c1[j] += d1[j];
}

该循环通过另一个外部for循环执行10000次。为了加快速度,我将代码更改为:

for (int j = 0; j < n; j++) {
    a1[j] += b1[j];
}

for (int j = 0; j < n; j++) {
    c1[j] += d1[j];
}

在Microsoft Visual C++10.0上编译,经过完全优化,并在Intel Core 2 Duo(x64)上启用了32位SSE2,第一个示例耗时5.5秒,双循环示例仅需1.9秒。

第一个循环的反汇编基本上是这样的(在整个程序中,这个块重复了大约五次):

movsd       xmm0,mmword ptr [edx+18h]
addsd       xmm0,mmword ptr [ecx+20h]
movsd       mmword ptr [ecx+20h],xmm0
movsd       xmm0,mmword ptr [esi+10h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [edx+20h]
addsd       xmm0,mmword ptr [ecx+28h]
movsd       mmword ptr [ecx+28h],xmm0
movsd       xmm0,mmword ptr [esi+18h]
addsd       xmm0,mmword ptr [eax+38h]

双循环示例的每个循环都会生成此代码(以下块重复大约三次):

addsd       xmm0,mmword ptr [eax+28h]
movsd       mmword ptr [eax+28h],xmm0
movsd       xmm0,mmword ptr [ecx+20h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [ecx+28h]
addsd       xmm0,mmword ptr [eax+38h]
movsd       mmword ptr [eax+38h],xmm0
movsd       xmm0,mmword ptr [ecx+30h]
addsd       xmm0,mmword ptr [eax+40h]
movsd       mmword ptr [eax+40h],xmm0

事实证明,这个问题无关紧要,因为行为严重依赖于数组(n)和CPU缓存的大小。因此,如果有进一步的兴趣,我会重新表述这个问题:

您能否深入了解导致不同缓存行为的细节,如下图中的五个区域所示?通过为这些CPU提供类似的图表,指出CPU/缓存架构之间的差异可能也很有趣。

这是完整的代码。它使用TBB Tick_Count进行更高分辨率的计时,可以通过不定义TBB_timing宏来禁用:

#include <iostream>
#include <iomanip>
#include <cmath>
#include <string>

//#define TBB_TIMING

#ifdef TBB_TIMING   
#include <tbb/tick_count.h>
using tbb::tick_count;
#else
#include <time.h>
#endif

using namespace std;

//#define preallocate_memory new_cont

enum { new_cont, new_sep };

double *a1, *b1, *c1, *d1;


void allo(int cont, int n)
{
    switch(cont) {
      case new_cont:
        a1 = new double[n*4];
        b1 = a1 + n;
        c1 = b1 + n;
        d1 = c1 + n;
        break;
      case new_sep:
        a1 = new double[n];
        b1 = new double[n];
        c1 = new double[n];
        d1 = new double[n];
        break;
    }

    for (int i = 0; i < n; i++) {
        a1[i] = 1.0;
        d1[i] = 1.0;
        c1[i] = 1.0;
        b1[i] = 1.0;
    }
}

void ff(int cont)
{
    switch(cont){
      case new_sep:
        delete[] b1;
        delete[] c1;
        delete[] d1;
      case new_cont:
        delete[] a1;
    }
}

double plain(int n, int m, int cont, int loops)
{
#ifndef preallocate_memory
    allo(cont,n);
#endif

#ifdef TBB_TIMING   
    tick_count t0 = tick_count::now();
#else
    clock_t start = clock();
#endif
        
    if (loops == 1) {
        for (int i = 0; i < m; i++) {
            for (int j = 0; j < n; j++){
                a1[j] += b1[j];
                c1[j] += d1[j];
            }
        }
    } else {
        for (int i = 0; i < m; i++) {
            for (int j = 0; j < n; j++) {
                a1[j] += b1[j];
            }
            for (int j = 0; j < n; j++) {
                c1[j] += d1[j];
            }
        }
    }
    double ret;

#ifdef TBB_TIMING   
    tick_count t1 = tick_count::now();
    ret = 2.0*double(n)*double(m)/(t1-t0).seconds();
#else
    clock_t end = clock();
    ret = 2.0*double(n)*double(m)/(double)(end - start) *double(CLOCKS_PER_SEC);
#endif
    
#ifndef preallocate_memory
    ff(cont);
#endif

    return ret;
}


void main()
{   
    freopen("C:\\test.csv", "w", stdout);

    char *s = " ";

    string na[2] ={"new_cont", "new_sep"};

    cout << "n";

    for (int j = 0; j < 2; j++)
        for (int i = 1; i <= 2; i++)
#ifdef preallocate_memory
            cout << s << i << "_loops_" << na[preallocate_memory];
#else
            cout << s << i << "_loops_" << na[j];
#endif
            
    cout << endl;

    long long nmax = 1000000;

#ifdef preallocate_memory
    allo(preallocate_memory, nmax);
#endif
    
    for (long long n = 1L; n < nmax; n = max(n+1, long long(n*1.2)))
    {
        const long long m = 10000000/n;
        cout << n;

        for (int j = 0; j < 2; j++)
            for (int i = 1; i <= 2; i++)
                cout << s << plain(n, m, j, i);
        cout << endl;
    }
}

它显示了n的不同值的FLOP/s。


当前回答

原始问题

为什么一个循环比两个循环慢得多?


结论:

案例1是一个经典的插值问题,它恰好是一个效率低下的问题。我还认为,这是许多机器架构和开发人员最终构建和设计具有多线程应用程序和并行编程能力的多核系统的主要原因之一。

从这种方法来看,不涉及硬件、操作系统和编译器如何协同工作来进行堆分配,包括使用RAM、缓存、页面文件等。;作为这些算法基础的数学向我们展示了这两个算法中哪一个是更好的解决方案。

我们可以将老板比喻为一个求和,它将代表一个必须在工人a和B之间传递的For循环。

我们可以很容易地看到,由于需要行驶的距离和工人之间所花费的时间的差异,情况2的速度至少是情况1的一半(如果不是稍多的话)。这种数学几乎与基准时间以及汇编指令中的差异数量完全吻合。


我现在将在下面开始解释所有这些是如何工作的。


评估问题

OP代码:

const int n=100000;

for(int j=0;j<n;j++){
    a1[j] += b1[j];
    c1[j] += d1[j];
}

And

for(int j=0;j<n;j++){
    a1[j] += b1[j];
}
for(int j=0;j<n;j++){
    c1[j] += d1[j];
}

对价

考虑到OP关于for循环的两个变体的原始问题,以及他对缓存行为的修正问题,以及许多其他优秀的答案和有用的评论;我想尝试在这里做一些不同的事情,对这种情况和问题采取不同的方法。


方法

考虑到这两个循环以及所有关于缓存和页面归档的讨论,我想从另一个角度来看待这一点。一种不涉及缓存和页面文件,也不涉及分配内存的执行的方法,实际上,这种方法甚至根本不涉及实际的硬件或软件。


透视图

在看了一段时间代码之后,问题是什么以及是什么导致了它变得非常明显。让我们将其分解为一个算法问题,并从使用数学符号的角度来看待它,然后将其类比于数学问题以及算法。


我们所知道的

我们知道这个循环将运行100000次。我们还知道a1、b1、c1和d1是64位体系结构上的指针。在32位机器上的C++中,所有指针都是4字节,而在64位机器上,它们的大小是8字节,因为指针的长度是固定的。

我们知道,在这两种情况下,我们都有32个字节可供分配。唯一的区别是我们在每次迭代中分配32个字节或两组2-8个字节,其中第二种情况是我们为两个独立循环的每次迭代分配16个字节。

这两个循环的总分配仍然等于32字节。有了这些信息,现在让我们继续展示这些概念的一般数学、算法和类比。

我们确实知道在这两种情况下必须执行同一组或一组操作的次数。我们确实知道在这两种情况下需要分配的内存量。我们可以评估两种情况下分配的总工作量大致相同。


What We Don't Know

We do not know how long it will take for each case unless if we set a counter and run a benchmark test. However, the benchmarks were already included from the original question and from some of the answers and comments as well; and we can see a significant difference between the two and this is the whole reasoning for this proposal to this problem.


Let's Investigate

It is already apparent that many have already done this by looking at the heap allocations, benchmark tests, looking at RAM, cache, and page files. Looking at specific data points and specific iteration indices were also included and the various conversations about this specific problem have many people starting to question other related things about it. How do we begin to look at this problem by using mathematical algorithms and applying an analogy to it? We start off by making a couple of assertions! Then we build out our algorithm from there.


Our Assertions:

  • We will let our loop and its iterations be a Summation that starts at 1 and ends at 100000 instead of starting with 0 as in the loops for we don't need to worry about the 0 indexing scheme of memory addressing since we are just interested in the algorithm itself.
  • In both cases we have four functions to work with and two function calls with two operations being done on each function call. We will set these up as functions and calls to functions as the following: F1(), F2(), f(a), f(b), f(c) and f(d).

The Algorithms:

1st Case: - Only one summation but two independent function calls.

Sum n=1 : [1,100000] = F1(), F2();
                       F1() = { f(a) = f(a) + f(b); }
                       F2() = { f(c) = f(c) + f(d); }

2nd Case: - Two summations but each has its own function call.

Sum1 n=1 : [1,100000] = F1();
                        F1() = { f(a) = f(a) + f(b); }

Sum2 n=1 : [1,100000] = F1();
                        F1() = { f(c) = f(c) + f(d); }

If you noticed F2() only exists in Sum from Case1 where F1() is contained in Sum from Case1 and in both Sum1 and Sum2 from Case2. This will be evident later on when we begin to conclude that there is an optimization that is happening within the second algorithm.

The iterations through the first case Sum calls f(a) that will add to its self f(b) then it calls f(c) that will do the same but add f(d) to itself for each 100000 iterations. In the second case, we have Sum1 and Sum2 that both act the same as if they were the same function being called twice in a row.

In this case we can treat Sum1 and Sum2 as just plain old Sum where Sum in this case looks like this: Sum n=1 : [1,100000] { f(a) = f(a) + f(b); } and now this looks like an optimization where we can just consider it to be the same function.


Summary with Analogy

With what we have seen in the second case it almost appears as if there is optimization since both for loops have the same exact signature, but this isn't the real issue. The issue isn't the work that is being done by f(a), f(b), f(c), and f(d). In both cases and the comparison between the two, it is the difference in the distance that the Summation has to travel in each case that gives you the difference in execution time.

Think of the for loops as being the summations that does the iterations as being a Boss that is giving orders to two people A & B and that their jobs are to meat C & D respectively and to pick up some package from them and return it. In this analogy, the for loops or summation iterations and condition checks themselves don't actually represent the Boss. What actually represents the Boss is not from the actual mathematical algorithms directly but from the actual concept of Scope and Code Block within a routine or subroutine, method, function, translation unit, etc. The first algorithm has one scope where the second algorithm has two consecutive scopes.

Within the first case on each call slip, the Boss goes to A and gives the order and A goes off to fetch B's package then the Boss goes to C and gives the orders to do the same and receive the package from D on each iteration.

Within the second case, the Boss works directly with A to go and fetch B's package until all packages are received. Then the Boss works with C to do the same for getting all of D's packages.

Since we are working with an 8-byte pointer and dealing with heap allocation let's consider the following problem. Let's say that the Boss is 100 feet from A and that A is 500 feet from C. We don't need to worry about how far the Boss is initially from C because of the order of executions. In both cases, the Boss initially travels from A first then to B. This analogy isn't to say that this distance is exact; it is just a useful test case scenario to show the workings of the algorithms.

In many cases when doing heap allocations and working with the cache and page files, these distances between address locations may not vary that much or they can vary significantly depending on the nature of the data types and the array sizes.


The Test Cases:

First Case: On first iteration the Boss has to initially go 100 feet to give the order slip to A and A goes off and does his thing, but then the Boss has to travel 500 feet to C to give him his order slip. Then on the next iteration and every other iteration after the Boss has to go back and forth 500 feet between the two.

Second Case: The Boss has to travel 100 feet on the first iteration to A, but after that, he is already there and just waits for A to get back until all slips are filled. Then the Boss has to travel 500 feet on the first iteration to C because C is 500 feet from A. Since this Boss( Summation, For Loop ) is being called right after working with A he then just waits there as he did with A until all of C's order slips are done.


The Difference In Distances Traveled

const n = 100000
distTraveledOfFirst = (100 + 500) + ((n-1)*(500 + 500));
// Simplify
distTraveledOfFirst = 600 + (99999*1000);
distTraveledOfFirst = 600 + 99999000;
distTraveledOfFirst = 99999600
// Distance Traveled On First Algorithm = 99,999,600ft

distTraveledOfSecond = 100 + 500 = 600;
// Distance Traveled On Second Algorithm = 600ft;

The Comparison of Arbitrary Values

We can easily see that 600 is far less than approximately 100 million. Now, this isn't exact, because we don't know the actual difference in distance between which address of RAM or from which cache or page file each call on each iteration is going to be due to many other unseen variables. This is just an assessment of the situation to be aware of and looking at it from the worst-case scenario.

From these numbers it would almost appear as if algorithm one should be 99% slower than algorithm two; however, this is only the Boss's part or responsibility of the algorithms and it doesn't account for the actual workers A, B, C, & D and what they have to do on each and every iteration of the Loop. So the boss's job only accounts for about 15 - 40% of the total work being done. The bulk of the work that is done through the workers has a slightly bigger impact towards keeping the ratio of the speed rate differences to about 50-70%


The Observation: - The differences between the two algorithms

In this situation, it is the structure of the process of the work being done. It goes to show that Case 2 is more efficient from both the partial optimization of having a similar function declaration and definition where it is only the variables that differ by name and the distance traveled.

We also see that the total distance traveled in Case 1 is much farther than it is in Case 2 and we can consider this distance traveled our Time Factor between the two algorithms. Case 1 has considerable more work to do than Case 2 does.

This is observable from the evidence of the assembly instructions that were shown in both cases. Along with what was already stated about these cases, this doesn't account for the fact that in Case 1 the boss will have to wait for both A & C to get back before he can go back to A again for each iteration. It also doesn't account for the fact that if A or B is taking an extremely long time then both the Boss and the other worker(s) are idle waiting to be executed.

In Case 2 the only one being idle is the Boss until the worker gets back. So even this has an impact on the algorithm.



The OP's Amended Question(s)

EDIT: The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:

Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?

It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.


Regarding These Questions

As I have demonstrated without a doubt, there is an underlying issue even before the Hardware and Software becomes involved.

Now as for the management of memory and caching along with page files, etc. which all work together in an integrated set of systems between the following:

  • The architecture (hardware, firmware, some embedded drivers, kernels and assembly instruction sets).
  • The OS (file and memory management systems, drivers and the registry).
  • The compiler (translation units and optimizations of the source code).
  • And even the source code itself with its set(s) of distinctive algorithms.

We can already see that there is a bottleneck that is happening within the first algorithm before we even apply it to any machine with any arbitrary architecture, OS, and programmable language compared to the second algorithm. There already existed a problem before involving the intrinsics of a modern computer.


The Ending Results

However; it is not to say that these new questions are not of importance because they themselves are and they do play a role after all. They do impact the procedures and the overall performance and that is evident with the various graphs and assessments from many who have given their answer(s) and or comment(s).

If you paid attention to the analogy of the Boss and the two workers A & B who had to go and retrieve packages from C & D respectively and considering the mathematical notations of the two algorithms in question; you can see without the involvement of the computer hardware and software Case 2 is approximately 60% faster than Case 1.

When you look at the graphs and charts after these algorithms have been applied to some source code, compiled, optimized, and executed through the OS to perform their operations on a given piece of hardware, you can even see a little more degradation between the differences in these algorithms.

If the Data set is fairly small it may not seem all that bad of a difference at first. However, since Case 1 is about 60 - 70% slower than Case 2 we can look at the growth of this function in terms of the differences in time executions:

DeltaTimeDifference approximately = Loop1(time) - Loop2(time)
//where
Loop1(time) = Loop2(time) + (Loop2(time)*[0.6,0.7]) // approximately
// So when we substitute this back into the difference equation we end up with
DeltaTimeDifference approximately = (Loop2(time) + (Loop2(time)*[0.6,0.7])) - Loop2(time)
// And finally we can simplify this to
DeltaTimeDifference approximately = [0.6,0.7]*Loop2(time)

This approximation is the average difference between these two loops both algorithmically and machine operations involving software optimizations and machine instructions.

When the data set grows linearly, so does the difference in time between the two. Algorithm 1 has more fetches than algorithm 2 which is evident when the Boss has to travel back and forth the maximum distance between A & C for every iteration after the first iteration while algorithm 2 the Boss has to travel to A once and then after being done with A he has to travel a maximum distance only one time when going from A to C.

Trying to have the Boss focusing on doing two similar things at once and juggling them back and forth instead of focusing on similar consecutive tasks is going to make him quite angry by the end of the day since he had to travel and work twice as much. Therefore do not lose the scope of the situation by letting your boss getting into an interpolated bottleneck because the boss's spouse and children wouldn't appreciate it.



Amendment: Software Engineering Design Principles

-- The difference between local Stack and heap allocated computations within iterative for loops and the difference between their usages, their efficiencies, and effectiveness --

The mathematical algorithm that I proposed above mainly applies to loops that perform operations on data that is allocated on the heap.

  • Consecutive Stack Operations:
    • If the loops are performing operations on data locally within a single code block or scope that is within the stack frame it will still sort of apply, but the memory locations are much closer where they are typically sequential and the difference in distance traveled or execution time is almost negligible. Since there are no allocations being done within the heap, the memory isn't scattered, and the memory isn't being fetched through ram. The memory is typically sequential and relative to the stack frame and stack pointer.
  • When consecutive operations are being done on the stack, a modern processor will cache repetitive values and addresses keeping these values within local cache registers. The time of operations or instructions here is on the order of nano-seconds.
  • Consecutive Heap Allocated Operations:
    • When you begin to apply heap allocations and the processor has to fetch the memory addresses on consecutive calls, depending on the architecture of the CPU, the bus controller, and the RAM modules the time of operations or execution can be on the order of micro to milliseconds. In comparison to cached stack operations, these are quite slow.
    • The CPU will have to fetch the memory address from RAM and typically anything across the system bus is slow compared to the internal data paths or data buses within the CPU itself.

So when you are working with data that needs to be on the heap and you are traversing through them in loops, it is more efficient to keep each data set and its corresponding algorithms within its own single loop. You will get better optimizations compared to trying to factor out consecutive loops by putting multiple operations of different data sets that are on the heap into a single loop.

It is okay to do this with data that is on the stack since they are frequently cached, but not for data that has to have its memory address queried every iteration.

This is where software engineering and software architecture design comes into play. It is the ability to know how to organize your data, knowing when to cache your data, knowing when to allocate your data on the heap, knowing how to design and implement your algorithms, and knowing when and where to call them.

You might have the same algorithm that pertains to the same data set, but you might want one implementation design for its stack variant and another for its heap-allocated variant just because of the above issue that is seen from its O(n) complexity of the algorithm when working with the heap.

From what I've noticed over the years, many people do not take this fact into consideration. They will tend to design one algorithm that works on a particular data set and they will use it regardless of the data set being locally cached on the stack or if it was allocated on the heap.

If you want true optimization, yes it might seem like code duplication, but to generalize it would be more efficient to have two variants of the same algorithm. One for stack operations, and the other for heap operations that are performed in iterative loops!

Here's a pseudo example: Two simple structs, one algorithm.

struct A {
    int data;
    A() : data{0}{}
    A(int a) : data{a}{}
};
struct B {
    int data;
    B() : data{0}{}
    A(int b) : data{b}{}
}

template<typename T>
void Foo( T& t ) {
    // Do something with t
}

// Some looping operation: first stack then heap.

// Stack data:
A dataSetA[10] = {};
B dataSetB[10] = {};

// For stack operations this is okay and efficient
for (int i = 0; i < 10; i++ ) {
   Foo(dataSetA[i]);
   Foo(dataSetB[i]);
}

// If the above two were on the heap then performing
// the same algorithm to both within the same loop
// will create that bottleneck
A* dataSetA = new [] A();
B* dataSetB = new [] B();
for ( int i = 0; i < 10; i++ ) {
    Foo(dataSetA[i]); // dataSetA is on the heap here
    Foo(dataSetB[i]); // dataSetB is on the heap here
} // this will be inefficient.

// To improve the efficiency above, put them into separate loops...

for (int i = 0; i < 10; i++ ) {
    Foo(dataSetA[i]);
}
for (int i = 0; i < 10; i++ ) {
    Foo(dataSetB[i]);
}
// This will be much more efficient than above.
// The code isn't perfect syntax, it's only pseudo code
// to illustrate a point.

This is what I was referring to by having separate implementations for stack variants versus heap variants. The algorithms themselves don't matter too much, it's the looping structures that you will use them in that do.

其他回答

第一个循环交替写入每个变量。第二个和第三个仅对元素大小进行小的跳跃。

试着用20厘米的纸和笔写两条平行的20个十字线。试着先完成一行,然后再完成另一行,试着在每一行中交替写一个十字。

这是因为CPU没有太多缓存未命中(它必须等待来自RAM芯片的阵列数据)。您可以不断调整数组的大小,使其超过CPU的一级缓存(L1)和二级缓存(L2)的大小,并根据数组的大小绘制代码执行所需的时间。图表不应该像你期望的那样是一条直线。

进一步分析后,我认为这(至少部分)是由四个指针的数据对齐造成的。这将导致一定程度的缓存库/通道冲突。

如果我猜对了如何分配数组,那么它们很可能与页面行对齐。

这意味着每个循环中的所有访问都将落在相同的缓存路径上。然而,英特尔处理器已经有一段时间具有8路L1缓存关联性。但事实上,表现并不完全一致。访问4路仍然比访问2路慢。

编辑:事实上,它看起来像是在单独分配所有数组。通常,当请求如此大的分配时,分配器将从OS请求新的页面。因此,大的分配很有可能出现在距页面边界相同的偏移处。

以下是测试代码:

int main(){
    const int n = 100000;

#ifdef ALLOCATE_SEPERATE
    double *a1 = (double*)malloc(n * sizeof(double));
    double *b1 = (double*)malloc(n * sizeof(double));
    double *c1 = (double*)malloc(n * sizeof(double));
    double *d1 = (double*)malloc(n * sizeof(double));
#else
    double *a1 = (double*)malloc(n * sizeof(double) * 4);
    double *b1 = a1 + n;
    double *c1 = b1 + n;
    double *d1 = c1 + n;
#endif

    //  Zero the data to prevent any chance of denormals.
    memset(a1,0,n * sizeof(double));
    memset(b1,0,n * sizeof(double));
    memset(c1,0,n * sizeof(double));
    memset(d1,0,n * sizeof(double));

    //  Print the addresses
    cout << a1 << endl;
    cout << b1 << endl;
    cout << c1 << endl;
    cout << d1 << endl;

    clock_t start = clock();

    int c = 0;
    while (c++ < 10000){

#if ONE_LOOP
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
            c1[j] += d1[j];
        }
#else
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
        }
        for(int j=0;j<n;j++){
            c1[j] += d1[j];
        }
#endif

    }

    clock_t end = clock();
    cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl;

    system("pause");
    return 0;
}

基准结果:

编辑:实际Core 2体系结构机器上的结果:

2 x Intel Xeon X5482 Harpertown@3.2 GHz:

#define ALLOCATE_SEPERATE
#define ONE_LOOP
00600020
006D0020
007A0020
00870020
seconds = 6.206

#define ALLOCATE_SEPERATE
//#define ONE_LOOP
005E0020
006B0020
00780020
00850020
seconds = 2.116

//#define ALLOCATE_SEPERATE
#define ONE_LOOP
00570020
00633520
006F6A20
007B9F20
seconds = 1.894

//#define ALLOCATE_SEPERATE
//#define ONE_LOOP
008C0020
00983520
00A46A20
00B09F20
seconds = 1.993

观察:

一圈6.206秒,两圈2.116秒。这准确地再现了OP的结果。在前两个测试中,阵列是单独分配的。您将注意到它们都具有与页面相同的对齐方式。在第二个测试中,阵列被打包在一起以打破这种对齐。在这里,您会注意到两个循环都更快。此外,第二个(双)循环现在是您通常预期的速度较慢的循环。

正如@Stephen Cannon在评论中指出的那样,这种对齐很可能会导致加载/存储单元或缓存中出现假别名。我在谷歌上搜索了一下,发现英特尔实际上有一个用于部分地址混淆暂停的硬件计数器:

http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/~amplifierxe/pmw_dp/events/partial_address_alias.html


5个地区-解释

区域1:

这个很简单。数据集如此之小,以至于性能受到开销(如循环和分支)的控制。

区域2:

在这里,随着数据大小的增加,相对开销的数量下降,性能“饱和”。这里两个循环比较慢,因为它有两倍的循环和分支开销。

我不知道这里到底发生了什么。。。当Agner Fog提到缓存库冲突时,对齐仍可能发挥作用。(这个链接是关于Sandy Bridge的,但这个想法应该仍然适用于核心2。)

区域3:

此时,数据不再适合一级缓存。因此,性能受到L1<->L2缓存带宽的限制。

区域4:

我们观察到的是单循环中的性能下降。如上所述,这是由于对齐(最有可能)导致处理器加载/存储单元中的假混叠暂停。

然而,为了发生假混叠,数据集之间必须有足够大的步长。这就是为什么你在区域3中看不到这一点。

区域5:

此时,缓存中没有任何内容。所以你受到内存带宽的限制。


第二个循环涉及的缓存活动更少,因此处理器更容易跟上内存需求。

原始问题

为什么一个循环比两个循环慢得多?


结论:

案例1是一个经典的插值问题,它恰好是一个效率低下的问题。我还认为,这是许多机器架构和开发人员最终构建和设计具有多线程应用程序和并行编程能力的多核系统的主要原因之一。

从这种方法来看,不涉及硬件、操作系统和编译器如何协同工作来进行堆分配,包括使用RAM、缓存、页面文件等。;作为这些算法基础的数学向我们展示了这两个算法中哪一个是更好的解决方案。

我们可以将老板比喻为一个求和,它将代表一个必须在工人a和B之间传递的For循环。

我们可以很容易地看到,由于需要行驶的距离和工人之间所花费的时间的差异,情况2的速度至少是情况1的一半(如果不是稍多的话)。这种数学几乎与基准时间以及汇编指令中的差异数量完全吻合。


我现在将在下面开始解释所有这些是如何工作的。


评估问题

OP代码:

const int n=100000;

for(int j=0;j<n;j++){
    a1[j] += b1[j];
    c1[j] += d1[j];
}

And

for(int j=0;j<n;j++){
    a1[j] += b1[j];
}
for(int j=0;j<n;j++){
    c1[j] += d1[j];
}

对价

考虑到OP关于for循环的两个变体的原始问题,以及他对缓存行为的修正问题,以及许多其他优秀的答案和有用的评论;我想尝试在这里做一些不同的事情,对这种情况和问题采取不同的方法。


方法

考虑到这两个循环以及所有关于缓存和页面归档的讨论,我想从另一个角度来看待这一点。一种不涉及缓存和页面文件,也不涉及分配内存的执行的方法,实际上,这种方法甚至根本不涉及实际的硬件或软件。


透视图

在看了一段时间代码之后,问题是什么以及是什么导致了它变得非常明显。让我们将其分解为一个算法问题,并从使用数学符号的角度来看待它,然后将其类比于数学问题以及算法。


我们所知道的

我们知道这个循环将运行100000次。我们还知道a1、b1、c1和d1是64位体系结构上的指针。在32位机器上的C++中,所有指针都是4字节,而在64位机器上,它们的大小是8字节,因为指针的长度是固定的。

我们知道,在这两种情况下,我们都有32个字节可供分配。唯一的区别是我们在每次迭代中分配32个字节或两组2-8个字节,其中第二种情况是我们为两个独立循环的每次迭代分配16个字节。

这两个循环的总分配仍然等于32字节。有了这些信息,现在让我们继续展示这些概念的一般数学、算法和类比。

我们确实知道在这两种情况下必须执行同一组或一组操作的次数。我们确实知道在这两种情况下需要分配的内存量。我们可以评估两种情况下分配的总工作量大致相同。


What We Don't Know

We do not know how long it will take for each case unless if we set a counter and run a benchmark test. However, the benchmarks were already included from the original question and from some of the answers and comments as well; and we can see a significant difference between the two and this is the whole reasoning for this proposal to this problem.


Let's Investigate

It is already apparent that many have already done this by looking at the heap allocations, benchmark tests, looking at RAM, cache, and page files. Looking at specific data points and specific iteration indices were also included and the various conversations about this specific problem have many people starting to question other related things about it. How do we begin to look at this problem by using mathematical algorithms and applying an analogy to it? We start off by making a couple of assertions! Then we build out our algorithm from there.


Our Assertions:

  • We will let our loop and its iterations be a Summation that starts at 1 and ends at 100000 instead of starting with 0 as in the loops for we don't need to worry about the 0 indexing scheme of memory addressing since we are just interested in the algorithm itself.
  • In both cases we have four functions to work with and two function calls with two operations being done on each function call. We will set these up as functions and calls to functions as the following: F1(), F2(), f(a), f(b), f(c) and f(d).

The Algorithms:

1st Case: - Only one summation but two independent function calls.

Sum n=1 : [1,100000] = F1(), F2();
                       F1() = { f(a) = f(a) + f(b); }
                       F2() = { f(c) = f(c) + f(d); }

2nd Case: - Two summations but each has its own function call.

Sum1 n=1 : [1,100000] = F1();
                        F1() = { f(a) = f(a) + f(b); }

Sum2 n=1 : [1,100000] = F1();
                        F1() = { f(c) = f(c) + f(d); }

If you noticed F2() only exists in Sum from Case1 where F1() is contained in Sum from Case1 and in both Sum1 and Sum2 from Case2. This will be evident later on when we begin to conclude that there is an optimization that is happening within the second algorithm.

The iterations through the first case Sum calls f(a) that will add to its self f(b) then it calls f(c) that will do the same but add f(d) to itself for each 100000 iterations. In the second case, we have Sum1 and Sum2 that both act the same as if they were the same function being called twice in a row.

In this case we can treat Sum1 and Sum2 as just plain old Sum where Sum in this case looks like this: Sum n=1 : [1,100000] { f(a) = f(a) + f(b); } and now this looks like an optimization where we can just consider it to be the same function.


Summary with Analogy

With what we have seen in the second case it almost appears as if there is optimization since both for loops have the same exact signature, but this isn't the real issue. The issue isn't the work that is being done by f(a), f(b), f(c), and f(d). In both cases and the comparison between the two, it is the difference in the distance that the Summation has to travel in each case that gives you the difference in execution time.

Think of the for loops as being the summations that does the iterations as being a Boss that is giving orders to two people A & B and that their jobs are to meat C & D respectively and to pick up some package from them and return it. In this analogy, the for loops or summation iterations and condition checks themselves don't actually represent the Boss. What actually represents the Boss is not from the actual mathematical algorithms directly but from the actual concept of Scope and Code Block within a routine or subroutine, method, function, translation unit, etc. The first algorithm has one scope where the second algorithm has two consecutive scopes.

Within the first case on each call slip, the Boss goes to A and gives the order and A goes off to fetch B's package then the Boss goes to C and gives the orders to do the same and receive the package from D on each iteration.

Within the second case, the Boss works directly with A to go and fetch B's package until all packages are received. Then the Boss works with C to do the same for getting all of D's packages.

Since we are working with an 8-byte pointer and dealing with heap allocation let's consider the following problem. Let's say that the Boss is 100 feet from A and that A is 500 feet from C. We don't need to worry about how far the Boss is initially from C because of the order of executions. In both cases, the Boss initially travels from A first then to B. This analogy isn't to say that this distance is exact; it is just a useful test case scenario to show the workings of the algorithms.

In many cases when doing heap allocations and working with the cache and page files, these distances between address locations may not vary that much or they can vary significantly depending on the nature of the data types and the array sizes.


The Test Cases:

First Case: On first iteration the Boss has to initially go 100 feet to give the order slip to A and A goes off and does his thing, but then the Boss has to travel 500 feet to C to give him his order slip. Then on the next iteration and every other iteration after the Boss has to go back and forth 500 feet between the two.

Second Case: The Boss has to travel 100 feet on the first iteration to A, but after that, he is already there and just waits for A to get back until all slips are filled. Then the Boss has to travel 500 feet on the first iteration to C because C is 500 feet from A. Since this Boss( Summation, For Loop ) is being called right after working with A he then just waits there as he did with A until all of C's order slips are done.


The Difference In Distances Traveled

const n = 100000
distTraveledOfFirst = (100 + 500) + ((n-1)*(500 + 500));
// Simplify
distTraveledOfFirst = 600 + (99999*1000);
distTraveledOfFirst = 600 + 99999000;
distTraveledOfFirst = 99999600
// Distance Traveled On First Algorithm = 99,999,600ft

distTraveledOfSecond = 100 + 500 = 600;
// Distance Traveled On Second Algorithm = 600ft;

The Comparison of Arbitrary Values

We can easily see that 600 is far less than approximately 100 million. Now, this isn't exact, because we don't know the actual difference in distance between which address of RAM or from which cache or page file each call on each iteration is going to be due to many other unseen variables. This is just an assessment of the situation to be aware of and looking at it from the worst-case scenario.

From these numbers it would almost appear as if algorithm one should be 99% slower than algorithm two; however, this is only the Boss's part or responsibility of the algorithms and it doesn't account for the actual workers A, B, C, & D and what they have to do on each and every iteration of the Loop. So the boss's job only accounts for about 15 - 40% of the total work being done. The bulk of the work that is done through the workers has a slightly bigger impact towards keeping the ratio of the speed rate differences to about 50-70%


The Observation: - The differences between the two algorithms

In this situation, it is the structure of the process of the work being done. It goes to show that Case 2 is more efficient from both the partial optimization of having a similar function declaration and definition where it is only the variables that differ by name and the distance traveled.

We also see that the total distance traveled in Case 1 is much farther than it is in Case 2 and we can consider this distance traveled our Time Factor between the two algorithms. Case 1 has considerable more work to do than Case 2 does.

This is observable from the evidence of the assembly instructions that were shown in both cases. Along with what was already stated about these cases, this doesn't account for the fact that in Case 1 the boss will have to wait for both A & C to get back before he can go back to A again for each iteration. It also doesn't account for the fact that if A or B is taking an extremely long time then both the Boss and the other worker(s) are idle waiting to be executed.

In Case 2 the only one being idle is the Boss until the worker gets back. So even this has an impact on the algorithm.



The OP's Amended Question(s)

EDIT: The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:

Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?

It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.


Regarding These Questions

As I have demonstrated without a doubt, there is an underlying issue even before the Hardware and Software becomes involved.

Now as for the management of memory and caching along with page files, etc. which all work together in an integrated set of systems between the following:

  • The architecture (hardware, firmware, some embedded drivers, kernels and assembly instruction sets).
  • The OS (file and memory management systems, drivers and the registry).
  • The compiler (translation units and optimizations of the source code).
  • And even the source code itself with its set(s) of distinctive algorithms.

We can already see that there is a bottleneck that is happening within the first algorithm before we even apply it to any machine with any arbitrary architecture, OS, and programmable language compared to the second algorithm. There already existed a problem before involving the intrinsics of a modern computer.


The Ending Results

However; it is not to say that these new questions are not of importance because they themselves are and they do play a role after all. They do impact the procedures and the overall performance and that is evident with the various graphs and assessments from many who have given their answer(s) and or comment(s).

If you paid attention to the analogy of the Boss and the two workers A & B who had to go and retrieve packages from C & D respectively and considering the mathematical notations of the two algorithms in question; you can see without the involvement of the computer hardware and software Case 2 is approximately 60% faster than Case 1.

When you look at the graphs and charts after these algorithms have been applied to some source code, compiled, optimized, and executed through the OS to perform their operations on a given piece of hardware, you can even see a little more degradation between the differences in these algorithms.

If the Data set is fairly small it may not seem all that bad of a difference at first. However, since Case 1 is about 60 - 70% slower than Case 2 we can look at the growth of this function in terms of the differences in time executions:

DeltaTimeDifference approximately = Loop1(time) - Loop2(time)
//where
Loop1(time) = Loop2(time) + (Loop2(time)*[0.6,0.7]) // approximately
// So when we substitute this back into the difference equation we end up with
DeltaTimeDifference approximately = (Loop2(time) + (Loop2(time)*[0.6,0.7])) - Loop2(time)
// And finally we can simplify this to
DeltaTimeDifference approximately = [0.6,0.7]*Loop2(time)

This approximation is the average difference between these two loops both algorithmically and machine operations involving software optimizations and machine instructions.

When the data set grows linearly, so does the difference in time between the two. Algorithm 1 has more fetches than algorithm 2 which is evident when the Boss has to travel back and forth the maximum distance between A & C for every iteration after the first iteration while algorithm 2 the Boss has to travel to A once and then after being done with A he has to travel a maximum distance only one time when going from A to C.

Trying to have the Boss focusing on doing two similar things at once and juggling them back and forth instead of focusing on similar consecutive tasks is going to make him quite angry by the end of the day since he had to travel and work twice as much. Therefore do not lose the scope of the situation by letting your boss getting into an interpolated bottleneck because the boss's spouse and children wouldn't appreciate it.



Amendment: Software Engineering Design Principles

-- The difference between local Stack and heap allocated computations within iterative for loops and the difference between their usages, their efficiencies, and effectiveness --

The mathematical algorithm that I proposed above mainly applies to loops that perform operations on data that is allocated on the heap.

  • Consecutive Stack Operations:
    • If the loops are performing operations on data locally within a single code block or scope that is within the stack frame it will still sort of apply, but the memory locations are much closer where they are typically sequential and the difference in distance traveled or execution time is almost negligible. Since there are no allocations being done within the heap, the memory isn't scattered, and the memory isn't being fetched through ram. The memory is typically sequential and relative to the stack frame and stack pointer.
  • When consecutive operations are being done on the stack, a modern processor will cache repetitive values and addresses keeping these values within local cache registers. The time of operations or instructions here is on the order of nano-seconds.
  • Consecutive Heap Allocated Operations:
    • When you begin to apply heap allocations and the processor has to fetch the memory addresses on consecutive calls, depending on the architecture of the CPU, the bus controller, and the RAM modules the time of operations or execution can be on the order of micro to milliseconds. In comparison to cached stack operations, these are quite slow.
    • The CPU will have to fetch the memory address from RAM and typically anything across the system bus is slow compared to the internal data paths or data buses within the CPU itself.

So when you are working with data that needs to be on the heap and you are traversing through them in loops, it is more efficient to keep each data set and its corresponding algorithms within its own single loop. You will get better optimizations compared to trying to factor out consecutive loops by putting multiple operations of different data sets that are on the heap into a single loop.

It is okay to do this with data that is on the stack since they are frequently cached, but not for data that has to have its memory address queried every iteration.

This is where software engineering and software architecture design comes into play. It is the ability to know how to organize your data, knowing when to cache your data, knowing when to allocate your data on the heap, knowing how to design and implement your algorithms, and knowing when and where to call them.

You might have the same algorithm that pertains to the same data set, but you might want one implementation design for its stack variant and another for its heap-allocated variant just because of the above issue that is seen from its O(n) complexity of the algorithm when working with the heap.

From what I've noticed over the years, many people do not take this fact into consideration. They will tend to design one algorithm that works on a particular data set and they will use it regardless of the data set being locally cached on the stack or if it was allocated on the heap.

If you want true optimization, yes it might seem like code duplication, but to generalize it would be more efficient to have two variants of the same algorithm. One for stack operations, and the other for heap operations that are performed in iterative loops!

Here's a pseudo example: Two simple structs, one algorithm.

struct A {
    int data;
    A() : data{0}{}
    A(int a) : data{a}{}
};
struct B {
    int data;
    B() : data{0}{}
    A(int b) : data{b}{}
}

template<typename T>
void Foo( T& t ) {
    // Do something with t
}

// Some looping operation: first stack then heap.

// Stack data:
A dataSetA[10] = {};
B dataSetB[10] = {};

// For stack operations this is okay and efficient
for (int i = 0; i < 10; i++ ) {
   Foo(dataSetA[i]);
   Foo(dataSetB[i]);
}

// If the above two were on the heap then performing
// the same algorithm to both within the same loop
// will create that bottleneck
A* dataSetA = new [] A();
B* dataSetB = new [] B();
for ( int i = 0; i < 10; i++ ) {
    Foo(dataSetA[i]); // dataSetA is on the heap here
    Foo(dataSetB[i]); // dataSetB is on the heap here
} // this will be inefficient.

// To improve the efficiency above, put them into separate loops...

for (int i = 0; i < 10; i++ ) {
    Foo(dataSetA[i]);
}
for (int i = 0; i < 10; i++ ) {
    Foo(dataSetB[i]);
}
// This will be much more efficient than above.
// The code isn't perfect syntax, it's only pseudo code
// to illustrate a point.

This is what I was referring to by having separate implementations for stack variants versus heap variants. The algorithms themselves don't matter too much, it's the looping structures that you will use them in that do.