最近我参加了一个面试,面试官要求我“编写一个程序,从一个包含10亿个数字的数组中找出100个最大的数字”。

我只能给出一个蛮力解决方案,即以O(nlogn)时间复杂度对数组进行排序,并取最后100个数字。

Arrays.sort(array);

面试官正在寻找一个更好的时间复杂度,我尝试了几个其他的解决方案,但都没有回答他。有没有更好的时间复杂度解决方案?


当前回答

Recently I am adapting a theory that all the problems in the world could be solved with O(1). And even this one. It wasn't clear from the question what is the range of the numbers. If the numbers are it range from 1 to 10, then probably the the top 100 largest numbers will be a group of 10. The chance that the highest number will be picked out of the 1 billion numbers when the highest number is very small in compare to to 1 billion are very big. So I would give this as an answer in that interview.

其他回答

如果在面试中被问到这个问题,面试官可能想看你解决问题的过程,而不仅仅是你的算法知识。

The description is quite general so maybe you can ask him the range or meaning of these numbers to make the problem clear. Doing this may impress an interviewer. If, for example, these numbers stands for people's age then it's a much easier problem. With a reasonable assumption that nobody alive is older than 200, you can use an integer array of size 200 (maybe 201) to count the number of people with the same age in just one iteration. Here the index means the age. After this it's a piece of cake to find 100 largest numbers. By the way this algorithm is called counting sort.

无论如何,让问题更具体、更清楚对你在面试中是有好处的。

从十亿个数字中找到前100个最好使用包含100个元素的最小堆。

首先用遇到的前100个数字对最小堆进行质数。Min-heap将前100个数字中最小的存储在根(顶部)。

现在,当你继续计算其他数字时,只将它们与根数(100中最小的数)进行比较。

如果遇到的新数字大于最小堆的根,则将根替换为该数字,否则忽略它。

作为在最小堆中插入新数字的一部分,堆中最小的数字将移到顶部(根)。

一旦我们遍历了所有的数字,我们将得到最小堆中最大的100个数字。

您可以使用快速选择算法在(按顺序)索引[十亿-101]处查找数字 然后遍历这些数字找出比这个数字更大的数。

array={...the billion numbers...} 
result[100];

pivot=QuickSelect(array,billion-101);//O(N)

for(i=0;i<billion;i++)//O(N)
   if(array[i]>=pivot)
      result.add(array[i]);

该算法时间为:2 X O(N) = O(N)(平均情况性能)

Thomas Jungblut建议的第二个选择是:

使用堆构建最大堆将花费O(N),然后前100个最大的数字将在堆的顶部,所有你需要的是把它们从堆(100 X O(Log(N))。

该算法时间为:O(N) + 100 X O(Log(N)) = O(N)

我知道这可能会被埋没,但这是我对一个基MSD的变化的想法。

伪代码:

//billion is the array of 1 billion numbers
int[] billion = getMyBillionNumbers();
//this assumes these are 32-bit integers and we are using hex digits
int[][] mynums = int[8][16];

for number in billion
    putInTop100Array(number)

function putInTop100Array(number){
    //basically if we got past all the digits successfully
    if(number == null)
        return true;
    msdIdx = getMsdIdx(number);
    msd = getMsd(number);
    //check if the idx above where we are is already full
    if(mynums[msdIdx][msd+1] > 99) {
        return false;
    } else if(putInTop100Array(removeMSD(number)){
        mynums[msdIdx][msd]++;
        //we've found 100 digits here, no need to keep looking below where we are
        if(mynums[msdIdx][msd] > 99){
           for(int i = 0; i < mds; i++){
              //making it 101 just so we can tell the difference
              //between numbers where we actually found 101, and 
              //where we just set it
              mynums[msdIdx][i] = 101;
           }
        }
        return true;
    }
    return false;
}

函数getMsdIdx(int num)将返回最高位(非零)的下标。函数getMsd(int num)将返回最高位。函数removeMSD(int num)将从一个数字中删除最有效的数字并返回该数字(如果删除最有效的数字后什么都没有留下,则返回null)。

完成后,剩下的就是遍历mynums以获取前100位数字。这大概是这样的:

int[] nums = int[100];
int idx = 0;
for(int i = 7; i >= 0; i--){
    int timesAdded = 0;
    for(int j = 16; j >=0 && timesAdded < 100; j--){
        for(int k = mynums[i][j]; k > 0; k--){
            nums[idx] += j;
            timesAdded++;
            idx++;
        }
    }
}

我需要注意的是,尽管上面的图看起来时间复杂度很高,但实际上它只有O(7*100)左右。

快速解释一下这是为了做什么: 从本质上讲,这个系统试图基于数字中数字的索引和数字的值来使用2d数组中的每个数字。它使用这些值作为索引来跟踪数组中插入了多少数值。当达到100时,它会关闭所有“较低的分支”。

这个算法的时间大概是O(十亿*log(16)*7)+O(100)。我可能是错的。此外,这很可能需要调试,因为它有点复杂,我只是把它写在我的头上。

编辑:没有解释的反对票是没有帮助的。如果你认为这个答案不正确,请留下评论。我很确定,StackOverflow甚至告诉你这样做,当你向下投票。

我看到了很多O(N)的讨论,所以我提出了一些不同的想法。

关于这些数字的性质有什么已知的信息吗?如果答案是随机的,那就不要再进一步了,看看其他答案。你不会得到比他们更好的结果。

However! See if whatever list-populating mechanism populated that list in a particular order. Are they in a well-defined pattern where you can know with certainty that the largest magnitude of numbers will be found in a certain region of the list or on a certain interval? There may be a pattern to it. If that is so, for example if they are guaranteed to be in some sort of normal distribution with the characteristic hump in the middle, always have repeating upward trends among defined subsets, have a prolonged spike at some time T in the middle of the data set like perhaps an incidence of insider trading or equipment failure, or maybe just have a "spike" every Nth number as in analysis of forces after a catastrophe, you can reduce the number of records you have to check significantly.

不管怎样,还是有一些值得思考的东西。也许这会帮助你给未来的面试官一个深思熟虑的回答。我知道,如果有人问我这样一个问题来回应这样的问题,我会印象深刻——这将告诉我,他们正在考虑优化。只是要认识到,优化的可能性并不总是存在的。