编写一个程序，从一个包含10亿个数字的数组中找出100个最大的数字

最近我参加了一个面试，面试官要求我“编写一个程序，从一个包含10亿个数字的数组中找出100个最大的数字”。

我只能给出一个蛮力解决方案，即以O(nlogn)时间复杂度对数组进行排序，并取最后100个数字。

Arrays.sort(array);

面试官正在寻找一个更好的时间复杂度，我尝试了几个其他的解决方案，但都没有回答他。有没有更好的时间复杂度解决方案?

当前回答

Time ~ O(100 * N)
Space ~ O(100 + N)

创建一个包含100个空槽的空列表对于输入列表中的每个数字: 如果数字小于第一个，跳过否则用这个数字代替它然后，将数字通过相邻的交换;直到它比下一个小返回列表

注意:如果log(input-list.size) + c < 100，那么最佳的方法是对输入列表进行排序，然后拆分前100项。

2013-10-09 06:19:07

其他回答

The simplest solution is to scan the billion numbers large array and hold the 100 largest values found so far in a small array buffer without any sorting and remember the smallest value of this buffer. First I thought this method was proposed by fordprefect but in a comment he said that he assumed the 100 number data structure being implemented as a heap. Whenever a new number is found that is larger then the minimum in the buffer is overwritten by the new value found and the buffer is searched for the current minimum again. If the numbers in billion number array are randomly distributed most of the time the value from the large array is compared to the minimum of the small array and discarded. Only for a very very small fraction of number the value must be inserted into the small array. So the difference of manipulating the data structure holding the small numbers can be neglected. For a small number of elements it is hard to determine if the usage of a priority queue is actually faster than using my naive approach.

I want to estimate the number of inserts in the small 100 element array buffer when the 10^9 element array is scanned. The program scans the first 1000 elements of this large array and has to insert at most 1000 elements in the buffer. The buffer contains 100 element of the 1000 elements scanned, that is 0.1 of the element scanned. So we assume that the probability that a value from the large array is larger than the current minimum of the buffer is about 0.1 Such an element has to be inserted in the buffer . Now the program scans the next 10^4 elements from the large array. Because the minimum of the buffer will increase every time a new element is inserted. We estimated that the ratio of elements larger than our current minimum is about 0.1 and so there are 0.1*10^4=1000 elements to insert. Actually the expected number of elements that are inserted into the buffer will be smaller. After the scan of this 10^4 elements fraction of the numbers in the buffer will be about 0.01 of the elements scanned so far. So when scanning the next 10^5 numbers we assume that not more than 0.01*10^5=1000 will be inserted in the buffer. Continuing this argumentation we have inserted about 7000 values after scanning 1000+10^4+10^5+...+10^9 ~ 10^9 elements of the large array. So when scanning an array with 10^9 elements of random size we expect not more than 10^4 (=7000 rounded up) insertions in the buffer. After each insertion into the buffer the new minimum must be found. If the buffer is a simple array we need 100 comparison to find the new minimum. If the buffer is another data structure (like a heap) we need at least 1 comparison to find the minimum. To compare the elements of the large array we need 10^9 comparisons. So all in all we need about 10^9+100*10^4=1.001 * 10^9 comparisons when using an array as buffer and at least 1.000 * 10^9 comparisons when using another type of data structure (like a heap). So using a heap brings only a gain of 0.1% if performance is determined by the number of comparison. But what is the difference in execution time between inserting an element in a 100 element heap and replacing an element in an 100 element array and finding its new minimum?

在理论层面:在堆中插入需要多少比较。我知道它是O(log(n))但常数因子有多大呢?我在机器级别:缓存和分支预测对堆插入和数组中线性搜索的执行时间有什么影响? 在实现级别:库或编译器提供的堆数据结构中隐藏了哪些额外成本?

我认为，在人们试图估计100个元素堆和100个元素数组的性能之间的真正区别之前，这些都是必须回答的一些问题。所以做一个实验并测量真实的表现是有意义的。

2013-10-08 14:35:44

简单的解决方案是使用优先队列，将前100个数字添加到队列中，并跟踪队列中最小的数字，然后遍历其他10亿个数字，每当我们发现一个比优先队列中最大的数字大的数字时，我们删除最小的数字，添加新的数字，并再次跟踪队列中最小的数字。

如果这些数字是随机顺序的，这就很好了，因为当我们迭代10亿个随机数字时，下一个数字是目前为止最大的100个数字之一的情况是非常罕见的。但这些数字可能不是随机的。如果数组已经按升序排序，则始终向优先队列插入一个元素。

我们先从数组中选取100,000个随机数。为了避免可能很慢的随机访问，我们添加了400个随机组，每个组有250个连续的数字。通过这种随机选择，我们可以非常确定，剩下的数字中很少有进入前100位的，因此执行时间将非常接近于一个简单的循环，将10亿个数字与某个最大值进行比较。

2016-04-04 18:42:33

我意识到这被标记为“算法”，但会抛出一些其他选项，因为它可能也应该被标记为“面试”。

10亿个数字的来源是什么?如果它是一个数据库，那么“从表中按值顺序选择值desc limit 100”就可以很好地完成工作-可能有方言差异。

这是一次性的，还是会重复发生?如果重复，频率是多少?如果它是一次性的，数据在一个文件中，那么'cat srcfile | sort(根据需要选择)| head -100'将让你快速完成有偿工作，而计算机处理这些琐碎的琐事。

如果重复，你会建议选择任何合适的方法来获得初始答案并存储/缓存结果，这样你就可以连续地报告前100名。

Finally, there is this consideration. Are you looking for an entry level job and interviewing with a geeky manager or future co-worker? If so, then you can toss out all manner of approaches describing the relative technical pros and cons. If you are looking for a more managerial job, then approach it like a manager would, concerned with the development and maintenance costs of the solution, and say "thank you very much" and leave if that is the interviewer wants to focus on CS trivia. He and you would be unlikely to have much advancement potential there.

祝你下次面试好运。

2013-10-08 22:09:02

另一个O(n)算法-

该算法通过消元法找到最大的100个

考虑所有的百万数字的二进制表示。从最重要的位开始。确定MSB是否为1可以通过布尔运算与适当的数字相乘来完成。如果百万个数字中有超过100个1，就去掉其他带0的数字。现在剩下的数从下一个最有效的位开始。计算排除后剩余数字的数量，只要这个数字大于100，就继续进行。

主要的布尔运算可以在图形处理器上并行完成

2013-10-09 12:40:14

Recently I am adapting a theory that all the problems in the world could be solved with O(1). And even this one. It wasn't clear from the question what is the range of the numbers. If the numbers are it range from 1 to 10, then probably the the top 100 largest numbers will be a group of 10. The chance that the highest number will be picked out of the 1 billion numbers when the highest number is very small in compare to to 1 billion are very big. So I would give this as an answer in that interview.

2013-10-15 19:35:39

编写一个程序，从一个包含10亿个数字的数组中找出100个最大的数字

推荐文章

最新文章

标签