在回答另一个Stack Overflow问题时,我偶然发现了一个有趣的子问题。对6个整数的数组进行排序的最快方法是什么?

因为问题层次很低:

我们不能假设库是可用的(而且调用本身也有开销),只有纯C 为了避免清空指令管道(这有非常高的成本),我们可能应该最小化分支、跳转和其他类型的控制流中断(比如隐藏在&&或||序列点后面的那些)。 空间是有限的,最小化寄存器和内存的使用是一个问题,理想情况下,就地排序可能是最好的。

实际上,这个问题是一种Golf,其目标不是最小化源长度,而是最小化执行时间。我称之为“Zening”代码,就像Michael Abrash在《Zen of code optimization》一书及其续集中所使用的那样。

至于为什么它有趣,有几个层面:

示例简单,易于理解和测量,不需要太多的C技能 它显示了对问题选择好的算法的影响,也显示了编译器和底层硬件的影响。

下面是我的参考(简单的,不是优化的)实现和测试集。

#include <stdio.h>

static __inline__ int sort6(int * d){

    char j, i, imin;
    int tmp;
    for (j = 0 ; j < 5 ; j++){
        imin = j;
        for (i = j + 1; i < 6 ; i++){
            if (d[i] < d[imin]){
                imin = i;
            }
        }
        tmp = d[j];
        d[j] = d[imin];
        d[imin] = tmp;
    }
}

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long long int x;
     __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
     return x;
}

int main(int argc, char ** argv){
    int i;
    int d[6][5] = {
        {1, 2, 3, 4, 5, 6},
        {6, 5, 4, 3, 2, 1},
        {100, 2, 300, 4, 500, 6},
        {100, 2, 3, 4, 500, 6},
        {1, 200, 3, 4, 5, 600},
        {1, 1, 2, 1, 2, 1}
    };

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
        sort6(d[i]);
        /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
         *  d[i][0], d[i][6], d[i][7],
         *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}

生的结果

随着变体的数量越来越多,我将它们都收集到一个测试套件中,可以在这里找到。在Kevin Stock的帮助下,实际使用的测试没有上面展示的那么简单。您可以在自己的环境中编译和执行它。我对不同目标架构/编译器上的行为很感兴趣。(好了,伙计们,把它放在答案里,我将+1一个新结果集的每个贡献者)。

一年前,我把答案给了Daniel Stutzbach(高尔夫),因为他是当时最快的解决方案(排序网络)的来源。

Linux 64位,gcc 4.6.1 64位,Intel Core 2 Duo E8400, -O2

Direct call to qsort library function : 689.38 Naive implementation (insertion sort) : 285.70 Insertion Sort (Daniel Stutzbach) : 142.12 Insertion Sort Unrolled : 125.47 Rank Order : 102.26 Rank Order with registers : 58.03 Sorting Networks (Daniel Stutzbach) : 111.68 Sorting Networks (Paul R) : 66.36 Sorting Networks 12 with Fast Swap : 58.86 Sorting Networks 12 reordered Swap : 53.74 Sorting Networks 12 reordered Simple Swap : 31.54 Reordered Sorting Network w/ fast swap : 31.54 Reordered Sorting Network w/ fast swap V2 : 33.63 Inlined Bubble Sort (Paolo Bonzini) : 48.85 Unrolled Insertion Sort (Paolo Bonzini) : 75.30

Linux 64位,gcc 4.6.1 64位,Intel Core 2 Duo E8400, -O1

Direct call to qsort library function : 705.93 Naive implementation (insertion sort) : 135.60 Insertion Sort (Daniel Stutzbach) : 142.11 Insertion Sort Unrolled : 126.75 Rank Order : 46.42 Rank Order with registers : 43.58 Sorting Networks (Daniel Stutzbach) : 115.57 Sorting Networks (Paul R) : 64.44 Sorting Networks 12 with Fast Swap : 61.98 Sorting Networks 12 reordered Swap : 54.67 Sorting Networks 12 reordered Simple Swap : 31.54 Reordered Sorting Network w/ fast swap : 31.24 Reordered Sorting Network w/ fast swap V2 : 33.07 Inlined Bubble Sort (Paolo Bonzini) : 45.79 Unrolled Insertion Sort (Paolo Bonzini) : 80.15

我包括了-O1和-O2的结果,因为令人惊讶的是,在一些程序中,O2的效率低于O1。我想知道什么具体的优化有这种效果?

对建议解决方案的评论

插入排序(丹尼尔·斯图茨巴赫)

正如预期的那样,最小化分支确实是一个好主意。

排序网络(丹尼尔·斯图茨巴赫)

比插入排序好。我想知道主要的效果是不是避免外部循环。我试着通过展开插入排序来检查,确实我们得到了大致相同的数字(代码在这里)。

排序网络(保罗R)

迄今为止最好的。我用来测试的实际代码在这里。目前还不知道为什么它的速度几乎是其他排序网络实现的两倍。参数传递?快速max ?

排序网络12 SWAP与快速交换

根据Daniel Stutzbach的建议,我将他的12交换排序网络与无分支快速交换相结合(代码在这里)。它确实更快,到目前为止最好的,只有很小的利润率(大约5%),因为可以使用更少的交换。

同样有趣的是,无分支交换似乎比在PPC架构上使用if的简单交换效率低得多(4倍)。

调用库qsort

To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).

等级次序

Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimizing this particular code (there was very little difference for other proposals).

更新:

正如上面公布的数字所示,这种效果在gcc的后续版本中仍然得到了增强,Rank Order的速度始终是其他任何替代版本的两倍。

用重新排序的交换对网络进行排序

The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2); you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.

使用简单交换对网络进行排序

One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.

“最佳”代码如下:

static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x) 
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
                    const int b = max(d[x], d[y]); \
                    d[x] = a; d[y] = b; }
    SWAP(1, 2);
    SWAP(4, 5);
    SWAP(0, 2);
    SWAP(3, 5);
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(1, 4);
    SWAP(0, 3);
    SWAP(2, 5);
    SWAP(1, 3);
    SWAP(2, 4);
    SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}

如果我们相信我们的测试集(是的,它很差,它的唯一好处是简短,简单,易于理解我们所测量的内容),那么一个排序的结果代码的平均循环次数低于40个循环(执行6个测试)。这使得每次交换平均为4个周期。我称之为惊人的快。还有其他可能的改进吗?


当前回答

几天前,我无意中从谷歌中发现了这个问题,因为我还需要快速排序一个由6个整数组成的固定长度数组。然而,在我的情况下,我的整数只有8位(而不是32位),我没有严格的要求只使用c。我想我无论如何都会分享我的发现,以防他们可能对某人有帮助……

我在程序集中实现了一个网络排序的变体,它使用SSE尽可能地向量化比较和交换操作。需要六次“传递”才能对数组进行完全排序。我使用了一种新颖的机制,直接将PCMPGTB(向量化比较)的结果转换为PSHUFB(向量化交换)的洗牌参数,只使用PADDB(向量化添加),在某些情况下还使用PAND(位与)指令。

这种方法也有产生真正无分支函数的副作用。没有任何跳跃指令。

这个实现似乎比目前在问题(“用简单交换排序网络12”)中被标记为最快选项的实现快38%左右。在测试期间,我修改了该实现以使用char数组元素,以使比较公平。

我应该指出,这种方法可以应用于任何大小不超过16个元素的数组。我希望在更大的数组中,相对于替代方案的速度优势会越来越大。

代码是用MASM编写的,适用于带有SSSE3的x86_64处理器。该函数使用“new”Windows x64调用约定。在这儿……

PUBLIC simd_sort_6

.DATA

ALIGN 16

pass1_shuffle   OWORD   0F0E0D0C0B0A09080706040503010200h
pass1_add       OWORD   0F0E0D0C0B0A09080706050503020200h
pass2_shuffle   OWORD   0F0E0D0C0B0A09080706030405000102h
pass2_and       OWORD   00000000000000000000FE00FEFE00FEh
pass2_add       OWORD   0F0E0D0C0B0A09080706050405020102h
pass3_shuffle   OWORD   0F0E0D0C0B0A09080706020304050001h
pass3_and       OWORD   00000000000000000000FDFFFFFDFFFFh
pass3_add       OWORD   0F0E0D0C0B0A09080706050404050101h
pass4_shuffle   OWORD   0F0E0D0C0B0A09080706050100020403h
pass4_and       OWORD   0000000000000000000000FDFD00FDFDh
pass4_add       OWORD   0F0E0D0C0B0A09080706050403020403h
pass5_shuffle   OWORD   0F0E0D0C0B0A09080706050201040300h
pass5_and       OWORD 0000000000000000000000FEFEFEFE00h
pass5_add       OWORD   0F0E0D0C0B0A09080706050403040300h
pass6_shuffle   OWORD   0F0E0D0C0B0A09080706050402030100h
pass6_add       OWORD   0F0E0D0C0B0A09080706050403030100h

.CODE

simd_sort_6 PROC FRAME

    .endprolog

    ; pxor xmm4, xmm4
    ; pinsrd xmm4, dword ptr [rcx], 0
    ; pinsrb xmm4, byte ptr [rcx + 4], 4
    ; pinsrb xmm4, byte ptr [rcx + 5], 5
    ; The benchmarked 38% faster mentioned in the text was with the above slower sequence that tied up the shuffle port longer.  Same on extract
    ; avoiding pins/extrb also means we don't need SSE 4.1, but SSSE3 CPUs without SSE4.1 (e.g. Conroe/Merom) have slow pshufb.
    movd    xmm4, dword ptr [rcx]
    pinsrw  xmm4,  word ptr [rcx + 4], 2  ; word 2 = bytes 4 and 5


    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass1_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass1_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass2_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass2_and]
    paddb xmm5, oword ptr [pass2_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass3_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass3_and]
    paddb xmm5, oword ptr [pass3_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass4_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass4_and]
    paddb xmm5, oword ptr [pass4_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass5_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass5_and]
    paddb xmm5, oword ptr [pass5_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass6_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass6_add]
    pshufb xmm4, xmm5

    ;pextrd dword ptr [rcx], xmm4, 0    ; benchmarked with this
    ;pextrb byte ptr [rcx + 4], xmm4, 4 ; slower version
    ;pextrb byte ptr [rcx + 5], xmm4, 5
    movd   dword ptr [rcx], xmm4
    pextrw  word ptr [rcx + 4], xmm4, 2  ; x86 is little-endian, so this is the right order

    ret

simd_sort_6 ENDP

END

您可以将其编译为可执行对象,并将其链接到您的C项目中。有关如何在Visual Studio中执行此操作的说明,您可以阅读这篇文章。你可以使用下面的C原型从你的C代码中调用这个函数:

void simd_sort_6(char *values);

其他回答

几天前,我无意中从谷歌中发现了这个问题,因为我还需要快速排序一个由6个整数组成的固定长度数组。然而,在我的情况下,我的整数只有8位(而不是32位),我没有严格的要求只使用c。我想我无论如何都会分享我的发现,以防他们可能对某人有帮助……

我在程序集中实现了一个网络排序的变体,它使用SSE尽可能地向量化比较和交换操作。需要六次“传递”才能对数组进行完全排序。我使用了一种新颖的机制,直接将PCMPGTB(向量化比较)的结果转换为PSHUFB(向量化交换)的洗牌参数,只使用PADDB(向量化添加),在某些情况下还使用PAND(位与)指令。

这种方法也有产生真正无分支函数的副作用。没有任何跳跃指令。

这个实现似乎比目前在问题(“用简单交换排序网络12”)中被标记为最快选项的实现快38%左右。在测试期间,我修改了该实现以使用char数组元素,以使比较公平。

我应该指出,这种方法可以应用于任何大小不超过16个元素的数组。我希望在更大的数组中,相对于替代方案的速度优势会越来越大。

代码是用MASM编写的,适用于带有SSSE3的x86_64处理器。该函数使用“new”Windows x64调用约定。在这儿……

PUBLIC simd_sort_6

.DATA

ALIGN 16

pass1_shuffle   OWORD   0F0E0D0C0B0A09080706040503010200h
pass1_add       OWORD   0F0E0D0C0B0A09080706050503020200h
pass2_shuffle   OWORD   0F0E0D0C0B0A09080706030405000102h
pass2_and       OWORD   00000000000000000000FE00FEFE00FEh
pass2_add       OWORD   0F0E0D0C0B0A09080706050405020102h
pass3_shuffle   OWORD   0F0E0D0C0B0A09080706020304050001h
pass3_and       OWORD   00000000000000000000FDFFFFFDFFFFh
pass3_add       OWORD   0F0E0D0C0B0A09080706050404050101h
pass4_shuffle   OWORD   0F0E0D0C0B0A09080706050100020403h
pass4_and       OWORD   0000000000000000000000FDFD00FDFDh
pass4_add       OWORD   0F0E0D0C0B0A09080706050403020403h
pass5_shuffle   OWORD   0F0E0D0C0B0A09080706050201040300h
pass5_and       OWORD 0000000000000000000000FEFEFEFE00h
pass5_add       OWORD   0F0E0D0C0B0A09080706050403040300h
pass6_shuffle   OWORD   0F0E0D0C0B0A09080706050402030100h
pass6_add       OWORD   0F0E0D0C0B0A09080706050403030100h

.CODE

simd_sort_6 PROC FRAME

    .endprolog

    ; pxor xmm4, xmm4
    ; pinsrd xmm4, dword ptr [rcx], 0
    ; pinsrb xmm4, byte ptr [rcx + 4], 4
    ; pinsrb xmm4, byte ptr [rcx + 5], 5
    ; The benchmarked 38% faster mentioned in the text was with the above slower sequence that tied up the shuffle port longer.  Same on extract
    ; avoiding pins/extrb also means we don't need SSE 4.1, but SSSE3 CPUs without SSE4.1 (e.g. Conroe/Merom) have slow pshufb.
    movd    xmm4, dword ptr [rcx]
    pinsrw  xmm4,  word ptr [rcx + 4], 2  ; word 2 = bytes 4 and 5


    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass1_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass1_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass2_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass2_and]
    paddb xmm5, oword ptr [pass2_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass3_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass3_and]
    paddb xmm5, oword ptr [pass3_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass4_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass4_and]
    paddb xmm5, oword ptr [pass4_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass5_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass5_and]
    paddb xmm5, oword ptr [pass5_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass6_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass6_add]
    pshufb xmm4, xmm5

    ;pextrd dword ptr [rcx], xmm4, 0    ; benchmarked with this
    ;pextrb byte ptr [rcx + 4], xmm4, 4 ; slower version
    ;pextrb byte ptr [rcx + 5], xmm4, 5
    movd   dword ptr [rcx], xmm4
    pextrw  word ptr [rcx + 4], xmm4, 2  ; x86 is little-endian, so this is the right order

    ret

simd_sort_6 ENDP

END

您可以将其编译为可执行对象,并将其链接到您的C项目中。有关如何在Visual Studio中执行此操作的说明,您可以阅读这篇文章。你可以使用下面的C原型从你的C代码中调用这个函数:

void simd_sort_6(char *values);

如果它只有6个元素,你可以利用并行性,想要最小化条件分支等等。为什么不生成所有的组合并测试顺序?我敢说,在某些架构中,它可以非常快(只要你预先分配了内存)

我发现至少在我的系统上,下面定义的函数sort6_iterator()和sort6_iterator_local()都运行得至少和上面的当前记录保持者一样快,而且经常明显更快:

#define MIN(x, y) (x<y?x:y)
#define MAX(x, y) (x<y?y:x)

template<class IterType> 
inline void sort6_iterator(IterType it) 
{
#define SWAP(x,y) { const auto a = MIN(*(it + x), *(it + y)); \
  const auto b = MAX(*(it + x), *(it + y)); \
  *(it + x) = a; *(it + y) = b; }

  SWAP(1, 2) SWAP(4, 5)
  SWAP(0, 2) SWAP(3, 5)
  SWAP(0, 1) SWAP(3, 4)
  SWAP(1, 4) SWAP(0, 3)
  SWAP(2, 5) SWAP(1, 3)
  SWAP(2, 4)
  SWAP(2, 3)
#undef SWAP
}

我在计时代码中给这个函数传递了std::vector的迭代器。

I suspect (from comments like this and elsewhere) that using iterators gives g++ certain assurances about what can and can't happen to the memory that the iterator refers to, which it otherwise wouldn't have and it is these assurances that allow g++ to better optimize the sorting code (e.g. with pointers, the compiler can't be sure that all pointers are pointing to different memory locations). If I remember correctly, this is also part of the reason why so many STL algorithms, such as std::sort(), generally have such obscenely good performance.

Moreover, sort6_iterator() is sometimes (again, depending on the context in which the function is called) consistently outperformed by the following sorting function, which copies the data into local variables before sorting them.1 Note that since there are only 6 local variables defined, if these local variables are primitives then they are likely never actually stored in RAM and are instead only ever stored in the CPU's registers until the end of the function call, which helps make this sorting function fast. (It also helps that the compiler knows that distinct local variables have distinct locations in memory).

template<class IterType> 
inline void sort6_iterator_local(IterType it) 
{
#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
  const auto b = MAX(data##x, data##y); \
  data##x = a; data##y = b; }
//DD = Define Data
#define DD1(a)   auto data##a = *(it + a);
#define DD2(a,b) auto data##a = *(it + a), data##b = *(it + b);
//CB = Copy Back
#define CB(a) *(it + a) = data##a;

  DD2(1,2)    SWAP(1, 2)
  DD2(4,5)    SWAP(4, 5)
  DD1(0)      SWAP(0, 2)
  DD1(3)      SWAP(3, 5)
  SWAP(0, 1)  SWAP(3, 4)
  SWAP(1, 4)  SWAP(0, 3)   CB(0)
  SWAP(2, 5)  CB(5)
  SWAP(1, 3)  CB(1)
  SWAP(2, 4)  CB(4)
  SWAP(2, 3)  CB(2)        CB(3)
#undef CB
#undef DD2
#undef DD1
#undef SWAP
}

请注意,按如下方式定义SWAP()有时会导致稍微更好的性能,但大多数情况下会导致稍微更差的性能或性能差异可以忽略不计。

#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
  data##y = MAX(data##x, data##y); \
  data##x = a; }

如果你只是想要一个排序算法,在基本数据类型上,gcc -O3始终擅长优化,无论排序函数的调用在1中出现什么上下文,然后,根据你传递输入的方式,尝试以下两种算法之一:

template<class T> inline void sort6(T it) {
#define SORT2(x,y) {if(data##x>data##y){auto a=std::move(data##y);data##y=std::move(data##x);data##x=std::move(a);}}
#define DD1(a)   register auto data##a=*(it+a);
#define DD2(a,b) register auto data##a=*(it+a);register auto data##b=*(it+b);
#define CB1(a)   *(it+a)=data##a;
#define CB2(a,b) *(it+a)=data##a;*(it+b)=data##b;
  DD2(1,2) SORT2(1,2)
  DD2(4,5) SORT2(4,5)
  DD1(0)   SORT2(0,2)
  DD1(3)   SORT2(3,5)
  SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
  SORT2(1,4) SORT2(0,3) CB1(0)
  SORT2(2,4) CB1(4)
  SORT2(1,3) CB1(1)
  SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}

或者如果你想通过引用传递变量,那么使用这个(下面的函数与上面的前5行不同):

template<class T> inline void sort6(T& e0, T& e1, T& e2, T& e3, T& e4, T& e5) {
#define SORT2(x,y) {if(data##x>data##y)std::swap(data##x,data##y);}
#define DD1(a)   register auto data##a=e##a;
#define DD2(a,b) register auto data##a=e##a;register auto data##b=e##b;
#define CB1(a)   e##a=data##a;
#define CB2(a,b) e##a=data##a;e##b=data##b;
  DD2(1,2) SORT2(1,2)
  DD2(4,5) SORT2(4,5)
  DD1(0)   SORT2(0,2)
  DD1(3)   SORT2(3,5)
  SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
  SORT2(1,4) SORT2(0,3) CB1(0)
  SORT2(2,4) CB1(4)
  SORT2(1,3) CB1(1)
  SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}

使用register关键字的原因是,这是少数几次需要在寄存器中使用这些值的情况之一。在没有寄存器的情况下,编译器会在大多数情况下找出这个问题,但有时不会。使用register关键字可以帮助解决这个问题。但是,通常不要使用register关键字,因为它更有可能减慢代码的速度而不是加快代码的速度。

另外,注意模板的使用。这样做是有目的的,因为即使使用内联关键字,gcc对模板函数的优化通常比普通C函数要积极得多(这与gcc需要处理普通C函数的函数指针而不是模板函数有关)。

While timing various sorting functions I noticed that the context (i.e. surrounding code) in which the call to the sorting function was made had a significant impact on performance, which is likely due to the function being inlined and then optimized. For instance, if the program was sufficiently simple then there usually wasn't much of a difference in performance between passing the sorting function a pointer versus passing it an iterator; otherwise using iterators usually resulted in noticeably better performance and never (in my experience so far at least) any noticeably worse performance. I suspect that this may be because g++ can globally optimize sufficiently simple code.

期待着尝试这一点,并从这些例子中学习,但首先要从我的1.5 GHz PPC Powerbook G4 w/ 1 GB DDR RAM中进行一些计时。(我从http://www.mcs.anl.gov/~kazutomo/rdtsc.html借用了一个类似于rdtsc的PPC定时器来计时。)我运行了几次程序,绝对结果各不相同,但始终最快的测试是“插入排序(Daniel Stutzbach)”,“插入排序展开”紧随其后。

下面是最后一组时间:

**Direct call to qsort library function** : 164
**Naive implementation (insertion sort)** : 138
**Insertion Sort (Daniel Stutzbach)**     : 85
**Insertion Sort Unrolled**               : 97
**Sorting Networks (Daniel Stutzbach)**   : 457
**Sorting Networks (Paul R)**             : 179
**Sorting Networks 12 with Fast Swap**    : 238
**Sorting Networks 12 reordered Swap**    : 236
**Rank Order**                            : 116

我认为你的问题有两个方面。

The first is to determine the optimal algorithm. This is done - at least in this case - by looping through every possible ordering (there aren't that many) which allows you to compute exact min, max, average and standard deviation of compares and swaps. Have a runner-up or two handy as well. The second is to optimize the algorithm. A lot can be done to convert textbook code examples to mean and lean real-life algorithms. If you realize that an algorithm can't be optimized to the extent required, try a runner-up.

I wouldn't worry too much about emptying pipelines (assuming current x86): branch prediction has come a long way. What I would worry about is making sure that the code and data fit in one cache line each (maybe two for the code). Once there fetch latencies are refreshingly low which will compensate for any stall. It also means that your inner loop will be maybe ten instructions or so which is right where it should be (there are two different inner loops in my sorting algorithm, they are 10 instructions/22 bytes and 9/22 long respectively). Assuming the code doesn't contain any divs you can be sure it will be blindingly fast.