最近我参加了一个面试,面试官要求我“编写一个程序,从一个包含10亿个数字的数组中找出100个最大的数字”。
我只能给出一个蛮力解决方案,即以O(nlogn)时间复杂度对数组进行排序,并取最后100个数字。
Arrays.sort(array);
面试官正在寻找一个更好的时间复杂度,我尝试了几个其他的解决方案,但都没有回答他。有没有更好的时间复杂度解决方案?
最近我参加了一个面试,面试官要求我“编写一个程序,从一个包含10亿个数字的数组中找出100个最大的数字”。
我只能给出一个蛮力解决方案,即以O(nlogn)时间复杂度对数组进行排序,并取最后100个数字。
Arrays.sort(array);
面试官正在寻找一个更好的时间复杂度,我尝试了几个其他的解决方案,但都没有回答他。有没有更好的时间复杂度解决方案?
当前回答
I would find out who had the time to put a billion numbers into an array and fire him. Must work for government. At least if you had a linked list you could insert a number into the middle without moving half a billion to make room. Even better a Btree allows for a binary search. Each comparison eliminates half of your total. A hash algorithm would allow you to populate the data structure like a checkerboard but not so good for sparse data. As it is your best bet is to have a solution array of 100 integers and keep track of the lowest number in your solution array so you can replace it when you come across a higher number in the original array. You would have to look at every element in the original array assuming it is not sorted to begin with.
其他回答
取十亿个数字中的前一百个,然后排序。现在只需遍历十亿,如果源数大于100中最小的数,则按排序顺序插入。你得到的结果更接近于O(n)除以集合的大小。
I would find out who had the time to put a billion numbers into an array and fire him. Must work for government. At least if you had a linked list you could insert a number into the middle without moving half a billion to make room. Even better a Btree allows for a binary search. Each comparison eliminates half of your total. A hash algorithm would allow you to populate the data structure like a checkerboard but not so good for sparse data. As it is your best bet is to have a solution array of 100 integers and keep track of the lowest number in your solution array so you can replace it when you come across a higher number in the original array. You would have to look at every element in the original array assuming it is not sorted to begin with.
两个选择:
(1)堆(priorityQueue)
维护最小堆的大小为100。遍历数组。一旦元素小于堆中的第一个元素,就替换它。
InSERT ELEMENT INTO HEAP: O(log100)
compare the first element: O(1)
There are n elements in the array, so the total would be O(nlog100), which is O(n)
(2)映射-约简模型。
这与hadoop中的单词计数示例非常相似。 映射工作:计算每个元素出现的频率或次数。 减约:获取顶部K元素。
通常,我会给招聘人员两个答案。他们喜欢什么就给什么。当然,映射缩减编码会很费事,因为您必须知道每个确切的参数。练习一下也无妨。 祝你好运。
从十亿个数字中找到前100个最好使用包含100个元素的最小堆。
首先用遇到的前100个数字对最小堆进行质数。Min-heap将前100个数字中最小的存储在根(顶部)。
现在,当你继续计算其他数字时,只将它们与根数(100中最小的数)进行比较。
如果遇到的新数字大于最小堆的根,则将根替换为该数字,否则忽略它。
作为在最小堆中插入新数字的一部分,堆中最小的数字将移到顶部(根)。
一旦我们遍历了所有的数字,我们将得到最小堆中最大的100个数字。
我看到了很多O(N)的讨论,所以我提出了一些不同的想法。
关于这些数字的性质有什么已知的信息吗?如果答案是随机的,那就不要再进一步了,看看其他答案。你不会得到比他们更好的结果。
However! See if whatever list-populating mechanism populated that list in a particular order. Are they in a well-defined pattern where you can know with certainty that the largest magnitude of numbers will be found in a certain region of the list or on a certain interval? There may be a pattern to it. If that is so, for example if they are guaranteed to be in some sort of normal distribution with the characteristic hump in the middle, always have repeating upward trends among defined subsets, have a prolonged spike at some time T in the middle of the data set like perhaps an incidence of insider trading or equipment failure, or maybe just have a "spike" every Nth number as in analysis of forces after a catastrophe, you can reduce the number of records you have to check significantly.
不管怎样,还是有一些值得思考的东西。也许这会帮助你给未来的面试官一个深思熟虑的回答。我知道,如果有人问我这样一个问题来回应这样的问题,我会印象深刻——这将告诉我,他们正在考虑优化。只是要认识到,优化的可能性并不总是存在的。