最近我参加了一个面试,面试官要求我“编写一个程序,从一个包含10亿个数字的数组中找出100个最大的数字”。

我只能给出一个蛮力解决方案,即以O(nlogn)时间复杂度对数组进行排序,并取最后100个数字。

Arrays.sort(array);

面试官正在寻找一个更好的时间复杂度,我尝试了几个其他的解决方案,但都没有回答他。有没有更好的时间复杂度解决方案?


当前回答

我知道这可能会被埋没,但这是我对一个基MSD的变化的想法。

伪代码:

//billion is the array of 1 billion numbers
int[] billion = getMyBillionNumbers();
//this assumes these are 32-bit integers and we are using hex digits
int[][] mynums = int[8][16];

for number in billion
    putInTop100Array(number)

function putInTop100Array(number){
    //basically if we got past all the digits successfully
    if(number == null)
        return true;
    msdIdx = getMsdIdx(number);
    msd = getMsd(number);
    //check if the idx above where we are is already full
    if(mynums[msdIdx][msd+1] > 99) {
        return false;
    } else if(putInTop100Array(removeMSD(number)){
        mynums[msdIdx][msd]++;
        //we've found 100 digits here, no need to keep looking below where we are
        if(mynums[msdIdx][msd] > 99){
           for(int i = 0; i < mds; i++){
              //making it 101 just so we can tell the difference
              //between numbers where we actually found 101, and 
              //where we just set it
              mynums[msdIdx][i] = 101;
           }
        }
        return true;
    }
    return false;
}

函数getMsdIdx(int num)将返回最高位(非零)的下标。函数getMsd(int num)将返回最高位。函数removeMSD(int num)将从一个数字中删除最有效的数字并返回该数字(如果删除最有效的数字后什么都没有留下,则返回null)。

完成后,剩下的就是遍历mynums以获取前100位数字。这大概是这样的:

int[] nums = int[100];
int idx = 0;
for(int i = 7; i >= 0; i--){
    int timesAdded = 0;
    for(int j = 16; j >=0 && timesAdded < 100; j--){
        for(int k = mynums[i][j]; k > 0; k--){
            nums[idx] += j;
            timesAdded++;
            idx++;
        }
    }
}

我需要注意的是,尽管上面的图看起来时间复杂度很高,但实际上它只有O(7*100)左右。

快速解释一下这是为了做什么: 从本质上讲,这个系统试图基于数字中数字的索引和数字的值来使用2d数组中的每个数字。它使用这些值作为索引来跟踪数组中插入了多少数值。当达到100时,它会关闭所有“较低的分支”。

这个算法的时间大概是O(十亿*log(16)*7)+O(100)。我可能是错的。此外,这很可能需要调试,因为它有点复杂,我只是把它写在我的头上。

编辑:没有解释的反对票是没有帮助的。如果你认为这个答案不正确,请留下评论。我很确定,StackOverflow甚至告诉你这样做,当你向下投票。

其他回答

The simplest solution is to scan the billion numbers large array and hold the 100 largest values found so far in a small array buffer without any sorting and remember the smallest value of this buffer. First I thought this method was proposed by fordprefect but in a comment he said that he assumed the 100 number data structure being implemented as a heap. Whenever a new number is found that is larger then the minimum in the buffer is overwritten by the new value found and the buffer is searched for the current minimum again. If the numbers in billion number array are randomly distributed most of the time the value from the large array is compared to the minimum of the small array and discarded. Only for a very very small fraction of number the value must be inserted into the small array. So the difference of manipulating the data structure holding the small numbers can be neglected. For a small number of elements it is hard to determine if the usage of a priority queue is actually faster than using my naive approach.

I want to estimate the number of inserts in the small 100 element array buffer when the 10^9 element array is scanned. The program scans the first 1000 elements of this large array and has to insert at most 1000 elements in the buffer. The buffer contains 100 element of the 1000 elements scanned, that is 0.1 of the element scanned. So we assume that the probability that a value from the large array is larger than the current minimum of the buffer is about 0.1 Such an element has to be inserted in the buffer . Now the program scans the next 10^4 elements from the large array. Because the minimum of the buffer will increase every time a new element is inserted. We estimated that the ratio of elements larger than our current minimum is about 0.1 and so there are 0.1*10^4=1000 elements to insert. Actually the expected number of elements that are inserted into the buffer will be smaller. After the scan of this 10^4 elements fraction of the numbers in the buffer will be about 0.01 of the elements scanned so far. So when scanning the next 10^5 numbers we assume that not more than 0.01*10^5=1000 will be inserted in the buffer. Continuing this argumentation we have inserted about 7000 values after scanning 1000+10^4+10^5+...+10^9 ~ 10^9 elements of the large array. So when scanning an array with 10^9 elements of random size we expect not more than 10^4 (=7000 rounded up) insertions in the buffer. After each insertion into the buffer the new minimum must be found. If the buffer is a simple array we need 100 comparison to find the new minimum. If the buffer is another data structure (like a heap) we need at least 1 comparison to find the minimum. To compare the elements of the large array we need 10^9 comparisons. So all in all we need about 10^9+100*10^4=1.001 * 10^9 comparisons when using an array as buffer and at least 1.000 * 10^9 comparisons when using another type of data structure (like a heap). So using a heap brings only a gain of 0.1% if performance is determined by the number of comparison. But what is the difference in execution time between inserting an element in a 100 element heap and replacing an element in an 100 element array and finding its new minimum?

在理论层面:在堆中插入需要多少比较。我知道它是O(log(n))但常数因子有多大呢?我 在机器级别:缓存和分支预测对堆插入和数组中线性搜索的执行时间有什么影响? 在实现级别:库或编译器提供的堆数据结构中隐藏了哪些额外成本?

我认为,在人们试图估计100个元素堆和100个元素数组的性能之间的真正区别之前,这些都是必须回答的一些问题。所以做一个实验并测量真实的表现是有意义的。

你可以遍历这些数字,需要O(n)

只要发现一个大于当前最小值的值,就将新值添加到一个大小为100的循环队列中。

循环队列的最小值就是新的比较值。继续往队列中添加。如果已满,则从队列中提取最小值。

Recently I am adapting a theory that all the problems in the world could be solved with O(1). And even this one. It wasn't clear from the question what is the range of the numbers. If the numbers are it range from 1 to 10, then probably the the top 100 largest numbers will be a group of 10. The chance that the highest number will be picked out of the 1 billion numbers when the highest number is very small in compare to to 1 billion are very big. So I would give this as an answer in that interview.

你可以在O(n)个时间内完成。只需遍历列表,并跟踪在任何给定点上看到的最大的100个数字,以及该组中的最小值。当你发现一个新的数字大于你的10个数字中的最小值,然后替换它并更新你的新的100的最小值(可能每次你都要花100的常数时间来确定,但这并不影响整体分析)。

从十亿个数字中找到前100个最好使用包含100个元素的最小堆。

首先用遇到的前100个数字对最小堆进行质数。Min-heap将前100个数字中最小的存储在根(顶部)。

现在,当你继续计算其他数字时,只将它们与根数(100中最小的数)进行比较。

如果遇到的新数字大于最小堆的根,则将根替换为该数字,否则忽略它。

作为在最小堆中插入新数字的一部分,堆中最小的数字将移到顶部(根)。

一旦我们遍历了所有的数字,我们将得到最小堆中最大的100个数字。