我有一个500000随机生成的Tuple<long,long,字符串>对象的列表,我正在执行一个简单的“之间”搜索:

var data = new List<Tuple<long,long,string>>(500000);
...
var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);

When I generate my random array and run my search for 100 randomly generated values of x, the searches complete in about four seconds. Knowing of the great wonders that sorting does to searching, however, I decided to sort my data - first by Item1, then by Item2, and finally by Item3 - before running my 100 searches. I expected the sorted version to perform a little faster because of branch prediction: my thinking has been that once we get to the point where Item1 == x, all further checks of t.Item1 <= x would predict the branch correctly as "no take", speeding up the tail portion of the search. Much to my surprise, the searches took twice as long on a sorted array!

我试着改变我运行实验的顺序,并为随机数生成器使用不同的种子,但效果是相同的:在未排序数组中搜索的速度几乎是在相同数组中搜索的两倍,但已排序!

有人能很好地解释这个奇怪的效应吗?下面是我测试的源代码;我使用的是。net 4.0。


private const int TotalCount = 500000;
private const int TotalQueries = 100;
private static long NextLong(Random r) {
    var data = new byte[8];
    r.NextBytes(data);
    return BitConverter.ToInt64(data, 0);
}
private class TupleComparer : IComparer<Tuple<long,long,string>> {
    public int Compare(Tuple<long,long,string> x, Tuple<long,long,string> y) {
        var res = x.Item1.CompareTo(y.Item1);
        if (res != 0) return res;
        res = x.Item2.CompareTo(y.Item2);
        return (res != 0) ? res : String.CompareOrdinal(x.Item3, y.Item3);
    }
}
static void Test(bool doSort) {
    var data = new List<Tuple<long,long,string>>(TotalCount);
    var random = new Random(1000000007);
    var sw = new Stopwatch();
    sw.Start();
    for (var i = 0 ; i != TotalCount ; i++) {
        var a = NextLong(random);
        var b = NextLong(random);
        if (a > b) {
            var tmp = a;
            a = b;
            b = tmp;
        }
        var s = string.Format("{0}-{1}", a, b);
        data.Add(Tuple.Create(a, b, s));
    }
    sw.Stop();
    if (doSort) {
        data.Sort(new TupleComparer());
    }
    Console.WriteLine("Populated in {0}", sw.Elapsed);
    sw.Reset();
    var total = 0L;
    sw.Start();
    for (var i = 0 ; i != TotalQueries ; i++) {
        var x = NextLong(random);
        var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);
        total += cnt;
    }
    sw.Stop();
    Console.WriteLine("Found {0} matches in {1} ({2})", total, sw.Elapsed, doSort ? "Sorted" : "Unsorted");
}
static void Main() {
    Test(false);
    Test(true);
    Test(false);
    Test(true);
}

Populated in 00:00:01.3176257
Found 15614281 matches in 00:00:04.2463478 (Unsorted)
Populated in 00:00:01.3345087
Found 15614281 matches in 00:00:08.5393730 (Sorted)
Populated in 00:00:01.3665681
Found 15614281 matches in 00:00:04.1796578 (Unsorted)
Populated in 00:00:01.3326378
Found 15614281 matches in 00:00:08.6027886 (Sorted)

当前回答

当使用无序列表时,所有元组都是按内存顺序访问的。它们被连续地分配到RAM中。cpu喜欢按顺序访问内存,因为它们可以推测地请求下一条缓存线,以便在需要时始终存在。

当你对列表进行排序时,你把它放入随机顺序,因为你的排序键是随机生成的。这意味着对元组成员的内存访问是不可预测的。CPU不能预取内存,几乎每次对元组的访问都是缓存丢失。

这是一个很好的例子,说明了GC内存管理的一个特殊优势:一起分配和一起使用的数据结构执行得非常好。他们有很大的参考地点。

在这种情况下,缓存失败的损失超过了保存的分支预测损失。

尝试切换到结构元组。这将恢复性能,因为在运行时不需要进行指针解引用来访问元组成员。

Chris Sinclair notes in the comments that "for TotalCount around 10,000 or less, the sorted version does perform faster". This is because a small list fits entirely into the CPU cache. The memory accesses might be unpredictable but the target is always in cache. I believe there is still a small penalty because even a load from cache takes some cycles. But that seems not to be a problem because the CPU can juggle multiple outstanding loads, thereby increasing throughput. Whenever the CPU hits a wait for memory it will still speed ahead in the instruction stream to queue as many memory operations as it can. This technique is used to hide latency.

这种行为表明,在现代cpu上预测性能是多么困难。事实上,当从顺序内存访问到随机内存访问时,我们只慢了2倍,这告诉我隐藏内存延迟的隐藏过程有多少。内存访问可以使CPU停止50-200个周期。考虑到第一点,当引入随机内存访问时,程序会变慢10倍。

其他回答

LINQ不知道你的列表是否已排序。

由于带有谓词参数的Count是所有IEnumerables的扩展方法,我认为它甚至不知道它是否在高效随机访问的集合上运行。所以,它只是检查每个元素,Usr解释了为什么性能变低了。

为了利用排序数组的性能优势(例如二进制搜索),您必须多编写一些代码。

当使用无序列表时,所有元组都是按内存顺序访问的。它们被连续地分配到RAM中。cpu喜欢按顺序访问内存,因为它们可以推测地请求下一条缓存线,以便在需要时始终存在。

当你对列表进行排序时,你把它放入随机顺序,因为你的排序键是随机生成的。这意味着对元组成员的内存访问是不可预测的。CPU不能预取内存,几乎每次对元组的访问都是缓存丢失。

这是一个很好的例子,说明了GC内存管理的一个特殊优势:一起分配和一起使用的数据结构执行得非常好。他们有很大的参考地点。

在这种情况下,缓存失败的损失超过了保存的分支预测损失。

尝试切换到结构元组。这将恢复性能,因为在运行时不需要进行指针解引用来访问元组成员。

Chris Sinclair notes in the comments that "for TotalCount around 10,000 or less, the sorted version does perform faster". This is because a small list fits entirely into the CPU cache. The memory accesses might be unpredictable but the target is always in cache. I believe there is still a small penalty because even a load from cache takes some cycles. But that seems not to be a problem because the CPU can juggle multiple outstanding loads, thereby increasing throughput. Whenever the CPU hits a wait for memory it will still speed ahead in the instruction stream to queue as many memory operations as it can. This technique is used to hide latency.

这种行为表明,在现代cpu上预测性能是多么困难。事实上,当从顺序内存访问到随机内存访问时,我们只慢了2倍,这告诉我隐藏内存延迟的隐藏过程有多少。内存访问可以使CPU停止50-200个周期。考虑到第一点,当引入随机内存访问时,程序会变慢10倍。