在回答另一个Stack Overflow问题时,我偶然发现了一个有趣的子问题。对6个整数的数组进行排序的最快方法是什么?

因为问题层次很低:

我们不能假设库是可用的(而且调用本身也有开销),只有纯C 为了避免清空指令管道(这有非常高的成本),我们可能应该最小化分支、跳转和其他类型的控制流中断(比如隐藏在&&或||序列点后面的那些)。 空间是有限的,最小化寄存器和内存的使用是一个问题,理想情况下,就地排序可能是最好的。

实际上,这个问题是一种Golf,其目标不是最小化源长度,而是最小化执行时间。我称之为“Zening”代码,就像Michael Abrash在《Zen of code optimization》一书及其续集中所使用的那样。

至于为什么它有趣,有几个层面:

示例简单,易于理解和测量,不需要太多的C技能 它显示了对问题选择好的算法的影响,也显示了编译器和底层硬件的影响。

下面是我的参考(简单的,不是优化的)实现和测试集。

#include <stdio.h>

static __inline__ int sort6(int * d){

    char j, i, imin;
    int tmp;
    for (j = 0 ; j < 5 ; j++){
        imin = j;
        for (i = j + 1; i < 6 ; i++){
            if (d[i] < d[imin]){
                imin = i;
            }
        }
        tmp = d[j];
        d[j] = d[imin];
        d[imin] = tmp;
    }
}

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long long int x;
     __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
     return x;
}

int main(int argc, char ** argv){
    int i;
    int d[6][5] = {
        {1, 2, 3, 4, 5, 6},
        {6, 5, 4, 3, 2, 1},
        {100, 2, 300, 4, 500, 6},
        {100, 2, 3, 4, 500, 6},
        {1, 200, 3, 4, 5, 600},
        {1, 1, 2, 1, 2, 1}
    };

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
        sort6(d[i]);
        /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
         *  d[i][0], d[i][6], d[i][7],
         *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}

生的结果

随着变体的数量越来越多,我将它们都收集到一个测试套件中,可以在这里找到。在Kevin Stock的帮助下,实际使用的测试没有上面展示的那么简单。您可以在自己的环境中编译和执行它。我对不同目标架构/编译器上的行为很感兴趣。(好了,伙计们,把它放在答案里,我将+1一个新结果集的每个贡献者)。

一年前,我把答案给了Daniel Stutzbach(高尔夫),因为他是当时最快的解决方案(排序网络)的来源。

Linux 64位,gcc 4.6.1 64位,Intel Core 2 Duo E8400, -O2

Direct call to qsort library function : 689.38 Naive implementation (insertion sort) : 285.70 Insertion Sort (Daniel Stutzbach) : 142.12 Insertion Sort Unrolled : 125.47 Rank Order : 102.26 Rank Order with registers : 58.03 Sorting Networks (Daniel Stutzbach) : 111.68 Sorting Networks (Paul R) : 66.36 Sorting Networks 12 with Fast Swap : 58.86 Sorting Networks 12 reordered Swap : 53.74 Sorting Networks 12 reordered Simple Swap : 31.54 Reordered Sorting Network w/ fast swap : 31.54 Reordered Sorting Network w/ fast swap V2 : 33.63 Inlined Bubble Sort (Paolo Bonzini) : 48.85 Unrolled Insertion Sort (Paolo Bonzini) : 75.30

Linux 64位,gcc 4.6.1 64位,Intel Core 2 Duo E8400, -O1

Direct call to qsort library function : 705.93 Naive implementation (insertion sort) : 135.60 Insertion Sort (Daniel Stutzbach) : 142.11 Insertion Sort Unrolled : 126.75 Rank Order : 46.42 Rank Order with registers : 43.58 Sorting Networks (Daniel Stutzbach) : 115.57 Sorting Networks (Paul R) : 64.44 Sorting Networks 12 with Fast Swap : 61.98 Sorting Networks 12 reordered Swap : 54.67 Sorting Networks 12 reordered Simple Swap : 31.54 Reordered Sorting Network w/ fast swap : 31.24 Reordered Sorting Network w/ fast swap V2 : 33.07 Inlined Bubble Sort (Paolo Bonzini) : 45.79 Unrolled Insertion Sort (Paolo Bonzini) : 80.15

我包括了-O1和-O2的结果,因为令人惊讶的是,在一些程序中,O2的效率低于O1。我想知道什么具体的优化有这种效果?

对建议解决方案的评论

插入排序(丹尼尔·斯图茨巴赫)

正如预期的那样,最小化分支确实是一个好主意。

排序网络(丹尼尔·斯图茨巴赫)

比插入排序好。我想知道主要的效果是不是避免外部循环。我试着通过展开插入排序来检查,确实我们得到了大致相同的数字(代码在这里)。

排序网络(保罗R)

迄今为止最好的。我用来测试的实际代码在这里。目前还不知道为什么它的速度几乎是其他排序网络实现的两倍。参数传递?快速max ?

排序网络12 SWAP与快速交换

根据Daniel Stutzbach的建议,我将他的12交换排序网络与无分支快速交换相结合(代码在这里)。它确实更快,到目前为止最好的,只有很小的利润率(大约5%),因为可以使用更少的交换。

同样有趣的是,无分支交换似乎比在PPC架构上使用if的简单交换效率低得多(4倍)。

调用库qsort

To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).

等级次序

Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimizing this particular code (there was very little difference for other proposals).

更新:

正如上面公布的数字所示,这种效果在gcc的后续版本中仍然得到了增强,Rank Order的速度始终是其他任何替代版本的两倍。

用重新排序的交换对网络进行排序

The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2); you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.

使用简单交换对网络进行排序

One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.

“最佳”代码如下:

static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x) 
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
                    const int b = max(d[x], d[y]); \
                    d[x] = a; d[y] = b; }
    SWAP(1, 2);
    SWAP(4, 5);
    SWAP(0, 2);
    SWAP(3, 5);
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(1, 4);
    SWAP(0, 3);
    SWAP(2, 5);
    SWAP(1, 3);
    SWAP(2, 4);
    SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}

如果我们相信我们的测试集(是的,它很差,它的唯一好处是简短,简单,易于理解我们所测量的内容),那么一个排序的结果代码的平均循环次数低于40个循环(执行6个测试)。这使得每次交换平均为4个周期。我称之为惊人的快。还有其他可能的改进吗?


当前回答

This question is becoming quite old, but I actually had to solve the same problem these days: fast agorithms to sort small arrays. I thought it would be a good idea to share my knowledge. While I first started by using sorting networks, I finally managed to find other algorithms for which the total number of comparisons performed to sort every permutation of 6 values was smaller than with sorting networks, and smaller than with insertion sort. I didn't count the number of swaps; I would expect it to be roughly equivalent (maybe a bit higher sometimes).

算法sort6使用算法sort4,算法sort4使用算法sort3。下面是一些轻量级c++形式的实现(原始的模板较多,因此可以使用任何随机访问迭代器和任何合适的比较函数)。

对3个值排序

下面的算法是展开插入排序。当必须执行两次交换(6个赋值)时,它使用4个赋值:

void sort3(int* array)
{
    if (array[1] < array[0]) {
        if (array[2] < array[0]) {
            if (array[2] < array[1]) {
                std::swap(array[0], array[2]);
            } else {
                int tmp = array[0];
                array[0] = array[1];
                array[1] = array[2];
                array[2] = tmp;
            }
        } else {
            std::swap(array[0], array[1]);
        }
    } else {
        if (array[2] < array[1]) {
            if (array[2] < array[0]) {
                int tmp = array[2];
                array[2] = array[1];
                array[1] = array[0];
                array[0] = tmp;
            } else {
                std::swap(array[1], array[2]);
            }
        }
    }
}

它看起来有点复杂,因为排序对于数组的每一个可能的排列都有或多或少的一个分支,使用2~3个比较和最多4个赋值来排序三个值。

对4个值排序

这个函数调用sort3,然后对数组的最后一个元素执行展开的插入排序:

void sort4(int* array)
{
    // Sort the first 3 elements
    sort3(array);

    // Insert the 4th element with insertion sort 
    if (array[3] < array[2]) {
        std::swap(array[2], array[3]);
        if (array[2] < array[1]) {
            std::swap(array[1], array[2]);
            if (array[1] < array[0]) {
                std::swap(array[0], array[1]);
            }
        }
    }
}

该算法执行3 ~ 6次比较,最多5次交换。展开插入排序很容易,但我们将使用另一种算法进行最后一种排序…

对6个值排序

这一个使用了我称之为双插入排序的展开版本。这个名字不是很好,但很有描述性,下面是它的工作原理:

对数组中除第一个和最后一个元素外的所有元素进行排序。 如果数组的第一个元素大于最后一个元素,则交换数组的第一个元素和最后一个元素。 从前面插入第一个元素,然后从后面插入最后一个元素。

交换后,第一个元素总是比最后一个小,这意味着,当将它们插入排序序列时,在最坏的情况下,插入这两个元素的比较不会超过N次:例如,如果第一个元素已经插入到第3个位置,那么最后一个元素不能插入到第4个位置以下。

void sort6(int* array)
{
    // Sort everything but first and last elements
    sort4(array+1);

    // Switch first and last elements if needed
    if (array[5] < array[0]) {
        std::swap(array[0], array[5]);
    }

    // Insert first element from the front
    if (array[1] < array[0]) {
        std::swap(array[0], array[1]);
        if (array[2] < array[1]) {
            std::swap(array[1], array[2]);
            if (array[3] < array[2]) {
                std::swap(array[2], array[3]);
                if (array[4] < array[3]) {
                    std::swap(array[3], array[4]);
                }
            }
        }
    }

    // Insert last element from the back
    if (array[5] < array[4]) {
        std::swap(array[4], array[5]);
        if (array[4] < array[3]) {
            std::swap(array[3], array[4]);
            if (array[3] < array[2]) {
                std::swap(array[2], array[3]);
                if (array[2] < array[1]) {
                    std::swap(array[1], array[2]);
                }
            }
        }
    }
}

我对6个值的每一次排列的测试表明,这个算法总是执行6到13个比较。我没有计算掉期的数量,但我认为在最坏的情况下它不会高于11。

我希望这能有所帮助,即使这个问题可能不再代表一个实际的问题:)

编辑:在将它放入提供的基准测试之后,它明显比大多数有趣的替代方案要慢。它的性能往往比展开插入排序好一点,但也仅此而已。基本上,它不是整数的最佳排序,但对于具有昂贵比较操作的类型可能很有趣。

其他回答

也许我来晚了,但至少我的贡献是一种新的方法。

The code really should be inlined even if inlined, there are too many branches the analysing part is basically O(N(N-1)) which seems OK for N=6 the code could be more effective if the cost of swap would be higher (irt the cost of compare) I trust on static functions being inlined. The method is related to rank-sort instead of ranks, the relative ranks (offsets) are used. the sum of the ranks is zero for every cycle in any permutation group. instead of SWAP()ing two elements, the cycles are chased, needing only one temp, and one (register->register) swap (new <- old).


更新:代码改动了一点,有些人用c++编译器编译C代码…

#include <stdio.h>

#if WANT_CHAR
typedef signed char Dif;
#else
typedef signed int Dif;
#endif

static int walksort (int *arr, int cnt);
static void countdifs (int *arr, Dif *dif, int cnt);
static void calcranks(int *arr, Dif *dif);

int wsort6(int *arr);

void do_print_a(char *msg, int *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
        fprintf(stderr, " %3d", *arr);
        }
fprintf(stderr,"\n");
}

void do_print_d(char *msg, Dif *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
        fprintf(stderr, " %3d", (int) *arr);
        }
fprintf(stderr,"\n");
}

static void inline countdifs (int *arr, Dif *dif, int cnt)
{
int top, bot;

for (top = 0; top < cnt; top++ ) {
        for (bot = 0; bot < top; bot++ ) {
                if (arr[top] < arr[bot]) { dif[top]--; dif[bot]++; }
                }
        }
return ;
}
        /* Copied from RexKerr ... */
static void inline calcranks(int *arr, Dif *dif){

dif[0] =     (arr[0]>arr[1])+(arr[0]>arr[2])+(arr[0]>arr[3])+(arr[0]>arr[4])+(arr[0]>arr[5]);
dif[1] = -1+ (arr[1]>=arr[0])+(arr[1]>arr[2])+(arr[1]>arr[3])+(arr[1]>arr[4])+(arr[1]>arr[5]);
dif[2] = -2+ (arr[2]>=arr[0])+(arr[2]>=arr[1])+(arr[2]>arr[3])+(arr[2]>arr[4])+(arr[2]>arr[5]);
dif[3] = -3+ (arr[3]>=arr[0])+(arr[3]>=arr[1])+(arr[3]>=arr[2])+(arr[3]>arr[4])+(arr[3]>arr[5]);
dif[4] = -4+ (arr[4]>=arr[0])+(arr[4]>=arr[1])+(arr[4]>=arr[2])+(arr[4]>=arr[3])+(arr[4]>arr[5]);
dif[5] = -(dif[0]+dif[1]+dif[2]+dif[3]+dif[4]);
}

static int walksort (int *arr, int cnt)
{
int idx, src,dst, nswap;

Dif difs[cnt];

#if WANT_REXK
calcranks(arr, difs);
#else
for (idx=0; idx < cnt; idx++) difs[idx] =0;
countdifs(arr, difs, cnt);
#endif
calcranks(arr, difs);

#define DUMP_IT 0
#if DUMP_IT
do_print_d("ISteps ", difs, cnt);
#endif

nswap = 0;
for (idx=0; idx < cnt; idx++) {
        int newval;
        int step,cyc;
        if ( !difs[idx] ) continue;
        newval = arr[idx];
        cyc = 0;
        src = idx;
        do      {
                int oldval;
                step = difs[src];
                difs[src] =0;
                dst = src + step;
                cyc += step ;
                if(dst == idx+1)idx=dst;
                oldval = arr[dst];
#if (DUMP_IT&1)
                fprintf(stderr, "[Nswap=%d] Cyc=%d Step=%2d Idx=%d  Old=%2d New=%2d #### Src=%d Dst=%d[%2d]->%2d <-- %d\n##\n"
                        , nswap, cyc, step, idx, oldval, newval
                        , src, dst, difs[dst], arr[dst]
                        , newval  );
                do_print_a("Array ", arr, cnt);
                do_print_d("Steps ", difs, cnt);
#endif

                arr[dst] = newval;
                newval = oldval;
                nswap++;
                src = dst;
                } while( cyc);
        }

return nswap;
}
/*************/
int wsort6(int *arr)
{
return walksort(arr, 6);
}

虽然我真的很喜欢交换宏提供:

#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }

我看到了一个改进(一个好的编译器可能会做到):

#define SWAP(x,y) { int tmp = ((x ^ y) & -(y < x)); y ^= tmp; x ^= tmp; }

我们注意到min和max是如何工作的,并显式地提取公共子表达式。这完全消除了min和max宏。

This question is becoming quite old, but I actually had to solve the same problem these days: fast agorithms to sort small arrays. I thought it would be a good idea to share my knowledge. While I first started by using sorting networks, I finally managed to find other algorithms for which the total number of comparisons performed to sort every permutation of 6 values was smaller than with sorting networks, and smaller than with insertion sort. I didn't count the number of swaps; I would expect it to be roughly equivalent (maybe a bit higher sometimes).

算法sort6使用算法sort4,算法sort4使用算法sort3。下面是一些轻量级c++形式的实现(原始的模板较多,因此可以使用任何随机访问迭代器和任何合适的比较函数)。

对3个值排序

下面的算法是展开插入排序。当必须执行两次交换(6个赋值)时,它使用4个赋值:

void sort3(int* array)
{
    if (array[1] < array[0]) {
        if (array[2] < array[0]) {
            if (array[2] < array[1]) {
                std::swap(array[0], array[2]);
            } else {
                int tmp = array[0];
                array[0] = array[1];
                array[1] = array[2];
                array[2] = tmp;
            }
        } else {
            std::swap(array[0], array[1]);
        }
    } else {
        if (array[2] < array[1]) {
            if (array[2] < array[0]) {
                int tmp = array[2];
                array[2] = array[1];
                array[1] = array[0];
                array[0] = tmp;
            } else {
                std::swap(array[1], array[2]);
            }
        }
    }
}

它看起来有点复杂,因为排序对于数组的每一个可能的排列都有或多或少的一个分支,使用2~3个比较和最多4个赋值来排序三个值。

对4个值排序

这个函数调用sort3,然后对数组的最后一个元素执行展开的插入排序:

void sort4(int* array)
{
    // Sort the first 3 elements
    sort3(array);

    // Insert the 4th element with insertion sort 
    if (array[3] < array[2]) {
        std::swap(array[2], array[3]);
        if (array[2] < array[1]) {
            std::swap(array[1], array[2]);
            if (array[1] < array[0]) {
                std::swap(array[0], array[1]);
            }
        }
    }
}

该算法执行3 ~ 6次比较,最多5次交换。展开插入排序很容易,但我们将使用另一种算法进行最后一种排序…

对6个值排序

这一个使用了我称之为双插入排序的展开版本。这个名字不是很好,但很有描述性,下面是它的工作原理:

对数组中除第一个和最后一个元素外的所有元素进行排序。 如果数组的第一个元素大于最后一个元素,则交换数组的第一个元素和最后一个元素。 从前面插入第一个元素,然后从后面插入最后一个元素。

交换后,第一个元素总是比最后一个小,这意味着,当将它们插入排序序列时,在最坏的情况下,插入这两个元素的比较不会超过N次:例如,如果第一个元素已经插入到第3个位置,那么最后一个元素不能插入到第4个位置以下。

void sort6(int* array)
{
    // Sort everything but first and last elements
    sort4(array+1);

    // Switch first and last elements if needed
    if (array[5] < array[0]) {
        std::swap(array[0], array[5]);
    }

    // Insert first element from the front
    if (array[1] < array[0]) {
        std::swap(array[0], array[1]);
        if (array[2] < array[1]) {
            std::swap(array[1], array[2]);
            if (array[3] < array[2]) {
                std::swap(array[2], array[3]);
                if (array[4] < array[3]) {
                    std::swap(array[3], array[4]);
                }
            }
        }
    }

    // Insert last element from the back
    if (array[5] < array[4]) {
        std::swap(array[4], array[5]);
        if (array[4] < array[3]) {
            std::swap(array[3], array[4]);
            if (array[3] < array[2]) {
                std::swap(array[2], array[3]);
                if (array[2] < array[1]) {
                    std::swap(array[1], array[2]);
                }
            }
        }
    }
}

我对6个值的每一次排列的测试表明,这个算法总是执行6到13个比较。我没有计算掉期的数量,但我认为在最坏的情况下它不会高于11。

我希望这能有所帮助,即使这个问题可能不再代表一个实际的问题:)

编辑:在将它放入提供的基准测试之后,它明显比大多数有趣的替代方案要慢。它的性能往往比展开插入排序好一点,但也仅此而已。基本上,它不是整数的最佳排序,但对于具有昂贵比较操作的类型可能很有趣。

The test code is pretty bad; it overflows the initial array (don't people here read compiler warnings?), the printf is printing out the wrong elements, it uses .byte for rdtsc for no good reason, there's only one run (!), there's nothing checking that the end results are actually correct (so it's very easy to “optimize” into something subtly wrong), the included tests are very rudimentary (no negative numbers?) and there's nothing to stop the compiler from just discarding the entire function as dead code.

话虽如此,改进二进制网络解决方案也很容易;简单地改变min/max/SWAP的东西

#define SWAP(x,y) { int tmp; asm("mov %0, %2 ; cmp %1, %0 ; cmovg %1, %0 ; cmovg %2, %1" : "=r" (d[x]), "=r" (d[y]), "=r" (tmp) : "0" (d[x]), "1" (d[y]) : "cc"); }

对我来说,它的速度快了65% (Debian gcc 4.4.5 with -O2, amd64, Core i7)。

异或交换在交换函数中可能很有用。

void xorSwap (int *x, int *y) {
     if (*x != *y) {
         *x ^= *y;
         *y ^= *x;
         *x ^= *y;
     }
 }

if可能会在代码中导致太多的分歧,但如果你能保证所有int都是唯一的,这可能会很方便。