编写一个程序，从一个包含10亿个数字的数组中找出100个最大的数字

你可以保留一个最大的100个数字的优先队列，遍历10亿个数字。每当遇到大于队列中最小数字(队列头)的数字时，删除队列头并将新数字添加到队列中。

用堆实现的优先级队列的插入+删除复杂度为O(log K).(其中K = 100，要查找的元素数量。N = 10亿，数组中元素的总数)。

在最坏的情况下，你得到十亿*log2(100)这比十亿*log2(十亿)对于O(N log N)基于比较的排序要好。

一般来说，如果你需要一组N个数字中最大的K个数字，复杂度是O(N log K)而不是O(N log N)，当K与N相比非常小时，这可能非常重要。

这种优先级队列算法的预期时间非常有趣，因为在每次迭代中可能会出现插入，也可能不会出现插入。

第i个数字插入队列的概率是一个随机变量大于同一分布中至少i- k个随机变量的概率(前k个数字自动添加到队列中)。我们可以使用顺序统计(见链接)来计算这个概率。

例如，假设这些数字是从{0,1}中均匀随机选择的，第(i-k)个数字(从i个数字中)的期望值为(i-k)/i，并且随机变量大于此值的概率为1-[(i-k)/i] = k/i。

因此，期望插入数为:

期望运行时间可表示为:

(k时间生成包含前k个元素的队列，然后是n-k个比较，以及如上所述的预期插入次数，每次插入的平均时间为log(k)/2)

注意，当N与K相比非常大时，这个表达式更接近于N而不是nlog K。这有点直观，就像在这个问题的情况下，即使经过10,000次迭代(与十亿次相比非常小)，一个数字被插入队列的机会也非常小。

但是我们不知道数组的值是均匀分布的。它们可能趋向于增加，在这种情况下，大多数或所有数字将成为所见最大的100个数字集合的新候选数。这个算法的最坏情况是O(N log K)

或者如果它们呈递减的趋势，最大的100个数字中的大多数将会非常早，我们的最佳情况运行时间本质上是O(N + K log K)对于K比N小得多的K，它就是O(N)

脚注1:O(N)整数排序/直方图

计数排序或基数排序都是O(N)，但通常有更大的常数因子，使它们在实践中比比较排序更差。在某些特殊情况下，它们实际上相当快，主要是对于窄整数类型。

例如，计数排序在数字很小的情况下表现良好。16位数字只需要2^16个计数器的数组。而不是实际展开到一个排序的数组，你可以扫描你建立的直方图作为计数排序的一部分。

在对数组进行直方图化之后，您可以快速回答任何顺序统计的查询，例如最大的99个数字，最大的200到100个数字)32位数字将计数分散到一个更大的数组或计数器哈希表中，可能需要16gib的内存(每个2^32个计数器4字节)。在真正的cpu上，可能会有很多TLB和缓存失误，不像2^16个元素的数组，L2缓存通常会命中。

类似地，Radix Sort可以在第一次传递后只查看顶部的桶。但常数因子仍然可能大于logk，这取决于K。

注意，每个计数器的大小足够大，即使所有N个整数都是重复的，也不会溢出。10亿略小于2^30，所以一个30位无符号计数器就足够了。32位有符号或无符号整数就可以了。

如果有更多的计数器，则可能需要64位计数器，初始化为零并随机访问需要占用两倍的内存。或者是少数溢出16或32位整数的计数器的哨兵值，以指示计数的其余部分在其他地方(在一个小字典中，例如映射到64位计数器的哈希表中)。

2013-10-07 14:45:54

这是谷歌或其他行业巨头提出的问题。也许下面的代码就是面试官想要的正确答案。时间成本和空间成本取决于输入数组中的最大数量。对于32位int数组输入，最大空间成本是4 * 125M字节，时间成本是5 *十亿。

public class TopNumber {
    public static void main(String[] args) {
        final int input[] = {2389,8922,3382,6982,5231,8934
                            ,4322,7922,6892,5224,4829,3829
                            ,6892,6872,4682,6723,8923,3492};
        //One int(4 bytes) hold 32 = 2^5 value,
        //About 4 * 125M Bytes
        //int sort[] = new int[1 << (32 - 5)];
        //Allocate small array for local test
        int sort[] = new int[1000];
        //Set all bit to 0
        for(int index = 0; index < sort.length; index++){
            sort[index] = 0;
        }
        for(int number : input){
            sort[number >>> 5] |= (1 << (number % 32));
        }
        int topNum = 0;
        outer:
        for(int index = sort.length - 1; index >= 0; index--){
            if(0 != sort[index]){
                for(int bit = 31; bit >= 0; bit--){
                    if(0 != (sort[index] & (1 << bit))){
                        System.out.println((index << 5) + bit);
                        topNum++;
                        if(topNum >= 3){
                            break outer;
                        }
                    }
                }
            }
        }
    }
}

2013-10-13 09:35:03

The simplest solution is to scan the billion numbers large array and hold the 100 largest values found so far in a small array buffer without any sorting and remember the smallest value of this buffer. First I thought this method was proposed by fordprefect but in a comment he said that he assumed the 100 number data structure being implemented as a heap. Whenever a new number is found that is larger then the minimum in the buffer is overwritten by the new value found and the buffer is searched for the current minimum again. If the numbers in billion number array are randomly distributed most of the time the value from the large array is compared to the minimum of the small array and discarded. Only for a very very small fraction of number the value must be inserted into the small array. So the difference of manipulating the data structure holding the small numbers can be neglected. For a small number of elements it is hard to determine if the usage of a priority queue is actually faster than using my naive approach.

I want to estimate the number of inserts in the small 100 element array buffer when the 10^9 element array is scanned. The program scans the first 1000 elements of this large array and has to insert at most 1000 elements in the buffer. The buffer contains 100 element of the 1000 elements scanned, that is 0.1 of the element scanned. So we assume that the probability that a value from the large array is larger than the current minimum of the buffer is about 0.1 Such an element has to be inserted in the buffer . Now the program scans the next 10^4 elements from the large array. Because the minimum of the buffer will increase every time a new element is inserted. We estimated that the ratio of elements larger than our current minimum is about 0.1 and so there are 0.1*10^4=1000 elements to insert. Actually the expected number of elements that are inserted into the buffer will be smaller. After the scan of this 10^4 elements fraction of the numbers in the buffer will be about 0.01 of the elements scanned so far. So when scanning the next 10^5 numbers we assume that not more than 0.01*10^5=1000 will be inserted in the buffer. Continuing this argumentation we have inserted about 7000 values after scanning 1000+10^4+10^5+...+10^9 ~ 10^9 elements of the large array. So when scanning an array with 10^9 elements of random size we expect not more than 10^4 (=7000 rounded up) insertions in the buffer. After each insertion into the buffer the new minimum must be found. If the buffer is a simple array we need 100 comparison to find the new minimum. If the buffer is another data structure (like a heap) we need at least 1 comparison to find the minimum. To compare the elements of the large array we need 10^9 comparisons. So all in all we need about 10^9+100*10^4=1.001 * 10^9 comparisons when using an array as buffer and at least 1.000 * 10^9 comparisons when using another type of data structure (like a heap). So using a heap brings only a gain of 0.1% if performance is determined by the number of comparison. But what is the difference in execution time between inserting an element in a 100 element heap and replacing an element in an 100 element array and finding its new minimum?

在理论层面:在堆中插入需要多少比较。我知道它是O(log(n))但常数因子有多大呢?我在机器级别:缓存和分支预测对堆插入和数组中线性搜索的执行时间有什么影响? 在实现级别:库或编译器提供的堆数据结构中隐藏了哪些额外成本?

我认为，在人们试图估计100个元素堆和100个元素数组的性能之间的真正区别之前，这些都是必须回答的一些问题。所以做一个实验并测量真实的表现是有意义的。

2013-10-08 14:35:44

受@ron teller回答的启发，这里有一个简单的C程序来做你想做的事情。

#include <stdlib.h>
#include <stdio.h>

#define TOTAL_NUMBERS 1000000000
#define N_TOP_NUMBERS 100

int 
compare_function(const void *first, const void *second)
{
    int a = *((int *) first);
    int b = *((int *) second);
    if (a > b){
        return 1;
    }
    if (a < b){
        return -1;
    }
    return 0;
}

int 
main(int argc, char ** argv)
{
    if(argc != 2){
        printf("please supply a path to a binary file containing 1000000000"
               "integers of this machine's wordlength and endianness\n");
        exit(1);
    }
    FILE * f = fopen(argv[1], "r");
    if(!f){
        exit(1);
    }
    int top100[N_TOP_NUMBERS] = {0};
    int sorts = 0;
    for (int i = 0; i < TOTAL_NUMBERS; i++){
        int number;
        int ok;
        ok = fread(&number, sizeof(int), 1, f);
        if(!ok){
            printf("not enough numbers!\n");
            break;
        }
        if(number > top100[0]){
            sorts++;
            top100[0] = number;
            qsort(top100, N_TOP_NUMBERS, sizeof(int), compare_function);
        }

    }
    printf("%d sorts made\n"
    "the top 100 integers in %s are:\n",
    sorts, argv[1] );
    for (int i = 0; i < N_TOP_NUMBERS; i++){
        printf("%d\n", top100[i]);
    }
    fclose(f);
    exit(0);
}

在我的机器上(具有快速SSD的core i3)，它需要25秒，并进行1724种排序。我用dd if=/dev/urandom/ count=1000000000 bs=1生成了一个二进制文件。

显然，一次只从磁盘读取4个字节会有性能问题，但这只是为了举例。好的一面是，只需要很少的内存。

2013-10-09 00:31:36

你可以保留一个最大的100个数字的优先队列，遍历10亿个数字。每当遇到大于队列中最小数字(队列头)的数字时，删除队列头并将新数字添加到队列中。

用堆实现的优先级队列的插入+删除复杂度为O(log K).(其中K = 100，要查找的元素数量。N = 10亿，数组中元素的总数)。

在最坏的情况下，你得到十亿*log2(100)这比十亿*log2(十亿)对于O(N log N)基于比较的排序要好。

一般来说，如果你需要一组N个数字中最大的K个数字，复杂度是O(N log K)而不是O(N log N)，当K与N相比非常小时，这可能非常重要。

这种优先级队列算法的预期时间非常有趣，因为在每次迭代中可能会出现插入，也可能不会出现插入。

第i个数字插入队列的概率是一个随机变量大于同一分布中至少i- k个随机变量的概率(前k个数字自动添加到队列中)。我们可以使用顺序统计(见链接)来计算这个概率。

例如，假设这些数字是从{0,1}中均匀随机选择的，第(i-k)个数字(从i个数字中)的期望值为(i-k)/i，并且随机变量大于此值的概率为1-[(i-k)/i] = k/i。

因此，期望插入数为:

期望运行时间可表示为:

(k时间生成包含前k个元素的队列，然后是n-k个比较，以及如上所述的预期插入次数，每次插入的平均时间为log(k)/2)

注意，当N与K相比非常大时，这个表达式更接近于N而不是nlog K。这有点直观，就像在这个问题的情况下，即使经过10,000次迭代(与十亿次相比非常小)，一个数字被插入队列的机会也非常小。

但是我们不知道数组的值是均匀分布的。它们可能趋向于增加，在这种情况下，大多数或所有数字将成为所见最大的100个数字集合的新候选数。这个算法的最坏情况是O(N log K)

或者如果它们呈递减的趋势，最大的100个数字中的大多数将会非常早，我们的最佳情况运行时间本质上是O(N + K log K)对于K比N小得多的K，它就是O(N)

脚注1:O(N)整数排序/直方图

计数排序或基数排序都是O(N)，但通常有更大的常数因子，使它们在实践中比比较排序更差。在某些特殊情况下，它们实际上相当快，主要是对于窄整数类型。

例如，计数排序在数字很小的情况下表现良好。16位数字只需要2^16个计数器的数组。而不是实际展开到一个排序的数组，你可以扫描你建立的直方图作为计数排序的一部分。

在对数组进行直方图化之后，您可以快速回答任何顺序统计的查询，例如最大的99个数字，最大的200到100个数字)32位数字将计数分散到一个更大的数组或计数器哈希表中，可能需要16gib的内存(每个2^32个计数器4字节)。在真正的cpu上，可能会有很多TLB和缓存失误，不像2^16个元素的数组，L2缓存通常会命中。

类似地，Radix Sort可以在第一次传递后只查看顶部的桶。但常数因子仍然可能大于logk，这取决于K。

注意，每个计数器的大小足够大，即使所有N个整数都是重复的，也不会溢出。10亿略小于2^30，所以一个30位无符号计数器就足够了。32位有符号或无符号整数就可以了。

如果有更多的计数器，则可能需要64位计数器，初始化为零并随机访问需要占用两倍的内存。或者是少数溢出16或32位整数的计数器的哨兵值，以指示计数的其余部分在其他地方(在一个小字典中，例如映射到64位计数器的哈希表中)。

2013-10-07 14:45:54

I would find out who had the time to put a billion numbers into an array and fire him. Must work for government. At least if you had a linked list you could insert a number into the middle without moving half a billion to make room. Even better a Btree allows for a binary search. Each comparison eliminates half of your total. A hash algorithm would allow you to populate the data structure like a checkerboard but not so good for sparse data. As it is your best bet is to have a solution array of 100 integers and keep track of the lowest number in your solution array so you can replace it when you come across a higher number in the original array. You would have to look at every element in the original array assuming it is not sorted to begin with.

2013-10-09 15:11:46

编写一个程序，从一个包含10亿个数字的数组中找出100个最大的数字

推荐文章

最新文章

标签