最近我参加了一个面试,面试官要求我“编写一个程序,从一个包含10亿个数字的数组中找出100个最大的数字”。

我只能给出一个蛮力解决方案,即以O(nlogn)时间复杂度对数组进行排序,并取最后100个数字。

Arrays.sort(array);

面试官正在寻找一个更好的时间复杂度,我尝试了几个其他的解决方案,但都没有回答他。有没有更好的时间复杂度解决方案?


当前回答

我意识到这被标记为“算法”,但会抛出一些其他选项,因为它可能也应该被标记为“面试”。

10亿个数字的来源是什么?如果它是一个数据库,那么“从表中按值顺序选择值desc limit 100”就可以很好地完成工作-可能有方言差异。

这是一次性的,还是会重复发生?如果重复,频率是多少?如果它是一次性的,数据在一个文件中,那么'cat srcfile | sort(根据需要选择)| head -100'将让你快速完成有偿工作,而计算机处理这些琐碎的琐事。

如果重复,你会建议选择任何合适的方法来获得初始答案并存储/缓存结果,这样你就可以连续地报告前100名。

Finally, there is this consideration. Are you looking for an entry level job and interviewing with a geeky manager or future co-worker? If so, then you can toss out all manner of approaches describing the relative technical pros and cons. If you are looking for a more managerial job, then approach it like a manager would, concerned with the development and maintenance costs of the solution, and say "thank you very much" and leave if that is the interviewer wants to focus on CS trivia. He and you would be unlikely to have much advancement potential there.

祝你下次面试好运。

其他回答

我做了我自己的代码,不确定它是否是“面试官”所寻找的

private static final int MAX=100;
 PriorityQueue<Integer> queue = new PriorityQueue<>(MAX);
        queue.add(array[0]);
        for (int i=1;i<array.length;i++)
        {

            if(queue.peek()<array[i])
            {
                if(queue.size() >=MAX)
                {
                    queue.poll();
                }
                queue.add(array[i]);

            }

        }
 Although in this question we should search for top 100 numbers, I will 
 generalize things and write x. Still, I will treat x as constant value.

n中最大的x元素:

我将调用返回值LIST。它是一个x元素的集合(在我看来应该是链表)

First x elements are taken from pool "as they come" and sorted in LIST (this is done in constant time since x is treated as constant - O( x log(x) ) time) For every element that comes next we check if it is bigger than smallest element in LIST and if is we pop out the smallest and insert current element to LIST. Since that is ordered list every element should find its place in logarithmic time (binary search) and since it is ordered list insertion is not a problem. Every step is also done in constant time ( O(log(x) ) time ).

那么,最坏的情况是什么?

xlog(x)+(n-x)(log(x)+1)=nlog(x)+n- x

最坏情况是O(n)时间。+1是检查数字是否大于LIST中最小的数字。平均情况的预期时间将取决于这n个元素的数学分布。

可能的改进

在最坏的情况下,这个算法可以稍微改进,但恕我直言(我无法证明这一点),这会降低平均行为。渐近行为是一样的。

该算法的改进在于,我们将不检查元素是否大于最小值。对于每个元素,我们将尝试插入它,如果它小于最小值,我们将忽略它。尽管如果我们只考虑我们将面临的最坏的情况,这听起来很荒谬

x log(x) + (n-x)log(x) = nlog(x)

操作。

对于这个用例,我没有看到任何进一步的改进。但是你必须问自己,如果我要对不同的x做多于log(n)次呢?显然,我们会以O(nlog (n))为单位对数组进行排序,并在需要时提取x元素。

可能的改进。

如果文件包含十亿的数字,读取它可能会很长…

为了提高工作效率,你可以:

将文件分成n个部分,创建n个线程,让n个线程在各自的部分中寻找最大的100个数字(使用优先级队列),最后得到所有线程输出的最大的100个数字。 使用像hadoop这样的解决方案,使用集群来完成这样的任务。在这里,您可以进一步分割文件,并更快地输出10亿(或10^12)个数字的文件。

我意识到这被标记为“算法”,但会抛出一些其他选项,因为它可能也应该被标记为“面试”。

10亿个数字的来源是什么?如果它是一个数据库,那么“从表中按值顺序选择值desc limit 100”就可以很好地完成工作-可能有方言差异。

这是一次性的,还是会重复发生?如果重复,频率是多少?如果它是一次性的,数据在一个文件中,那么'cat srcfile | sort(根据需要选择)| head -100'将让你快速完成有偿工作,而计算机处理这些琐碎的琐事。

如果重复,你会建议选择任何合适的方法来获得初始答案并存储/缓存结果,这样你就可以连续地报告前100名。

Finally, there is this consideration. Are you looking for an entry level job and interviewing with a geeky manager or future co-worker? If so, then you can toss out all manner of approaches describing the relative technical pros and cons. If you are looking for a more managerial job, then approach it like a manager would, concerned with the development and maintenance costs of the solution, and say "thank you very much" and leave if that is the interviewer wants to focus on CS trivia. He and you would be unlikely to have much advancement potential there.

祝你下次面试好运。

I would find out who had the time to put a billion numbers into an array and fire him. Must work for government. At least if you had a linked list you could insert a number into the middle without moving half a billion to make room. Even better a Btree allows for a binary search. Each comparison eliminates half of your total. A hash algorithm would allow you to populate the data structure like a checkerboard but not so good for sparse data. As it is your best bet is to have a solution array of 100 integers and keep track of the lowest number in your solution array so you can replace it when you come across a higher number in the original array. You would have to look at every element in the original array assuming it is not sorted to begin with.