最近我参加了一个面试,面试官要求我“编写一个程序,从一个包含10亿个数字的数组中找出100个最大的数字”。
我只能给出一个蛮力解决方案,即以O(nlogn)时间复杂度对数组进行排序,并取最后100个数字。
Arrays.sort(array);
面试官正在寻找一个更好的时间复杂度,我尝试了几个其他的解决方案,但都没有回答他。有没有更好的时间复杂度解决方案?
最近我参加了一个面试,面试官要求我“编写一个程序,从一个包含10亿个数字的数组中找出100个最大的数字”。
我只能给出一个蛮力解决方案,即以O(nlogn)时间复杂度对数组进行排序,并取最后100个数字。
Arrays.sort(array);
面试官正在寻找一个更好的时间复杂度,我尝试了几个其他的解决方案,但都没有回答他。有没有更好的时间复杂度解决方案?
当前回答
此代码用于在未排序数组中查找N个最大的数字。
#include <iostream>
using namespace std;
#define Array_Size 5 // No Of Largest Numbers To Find
#define BILLION 10000000000
void findLargest(int max[], int array[]);
int checkDup(int temp, int max[]);
int main() {
int array[BILLION] // contains data
int i=0, temp;
int max[Array_Size];
findLargest(max,array);
cout<< "The "<< Array_Size<< " largest numbers in the array are: \n";
for(i=0; i< Array_Size; i++)
cout<< max[i] << endl;
return 0;
}
void findLargest(int max[], int array[])
{
int i,temp,res;
for(int k=0; k< Array_Size; k++)
{
i=0;
while(i < BILLION)
{
for(int j=0; j< Array_Size ; j++)
{
temp = array[i];
res= checkDup(temp,max);
if(res == 0 && max[j] < temp)
max[j] = temp;
}
i++;
}
}
}
int checkDup(int temp, int max[])
{
for(int i=0; i<N_O_L_N_T_F; i++)
{
if(max[i] == temp)
return -1;
}
return 0;
}
这可能不是一个有效的方法,但可以完成工作。
希望这能有所帮助
其他回答
这是谷歌或其他行业巨头提出的问题。也许下面的代码就是面试官想要的正确答案。 时间成本和空间成本取决于输入数组中的最大数量。对于32位int数组输入,最大空间成本是4 * 125M字节,时间成本是5 *十亿。
public class TopNumber {
public static void main(String[] args) {
final int input[] = {2389,8922,3382,6982,5231,8934
,4322,7922,6892,5224,4829,3829
,6892,6872,4682,6723,8923,3492};
//One int(4 bytes) hold 32 = 2^5 value,
//About 4 * 125M Bytes
//int sort[] = new int[1 << (32 - 5)];
//Allocate small array for local test
int sort[] = new int[1000];
//Set all bit to 0
for(int index = 0; index < sort.length; index++){
sort[index] = 0;
}
for(int number : input){
sort[number >>> 5] |= (1 << (number % 32));
}
int topNum = 0;
outer:
for(int index = sort.length - 1; index >= 0; index--){
if(0 != sort[index]){
for(int bit = 31; bit >= 0; bit--){
if(0 != (sort[index] & (1 << bit))){
System.out.println((index << 5) + bit);
topNum++;
if(topNum >= 3){
break outer;
}
}
}
}
}
}
}
The simplest solution is to scan the billion numbers large array and hold the 100 largest values found so far in a small array buffer without any sorting and remember the smallest value of this buffer. First I thought this method was proposed by fordprefect but in a comment he said that he assumed the 100 number data structure being implemented as a heap. Whenever a new number is found that is larger then the minimum in the buffer is overwritten by the new value found and the buffer is searched for the current minimum again. If the numbers in billion number array are randomly distributed most of the time the value from the large array is compared to the minimum of the small array and discarded. Only for a very very small fraction of number the value must be inserted into the small array. So the difference of manipulating the data structure holding the small numbers can be neglected. For a small number of elements it is hard to determine if the usage of a priority queue is actually faster than using my naive approach.
I want to estimate the number of inserts in the small 100 element array buffer when the 10^9 element array is scanned. The program scans the first 1000 elements of this large array and has to insert at most 1000 elements in the buffer. The buffer contains 100 element of the 1000 elements scanned, that is 0.1 of the element scanned. So we assume that the probability that a value from the large array is larger than the current minimum of the buffer is about 0.1 Such an element has to be inserted in the buffer . Now the program scans the next 10^4 elements from the large array. Because the minimum of the buffer will increase every time a new element is inserted. We estimated that the ratio of elements larger than our current minimum is about 0.1 and so there are 0.1*10^4=1000 elements to insert. Actually the expected number of elements that are inserted into the buffer will be smaller. After the scan of this 10^4 elements fraction of the numbers in the buffer will be about 0.01 of the elements scanned so far. So when scanning the next 10^5 numbers we assume that not more than 0.01*10^5=1000 will be inserted in the buffer. Continuing this argumentation we have inserted about 7000 values after scanning 1000+10^4+10^5+...+10^9 ~ 10^9 elements of the large array. So when scanning an array with 10^9 elements of random size we expect not more than 10^4 (=7000 rounded up) insertions in the buffer. After each insertion into the buffer the new minimum must be found. If the buffer is a simple array we need 100 comparison to find the new minimum. If the buffer is another data structure (like a heap) we need at least 1 comparison to find the minimum. To compare the elements of the large array we need 10^9 comparisons. So all in all we need about 10^9+100*10^4=1.001 * 10^9 comparisons when using an array as buffer and at least 1.000 * 10^9 comparisons when using another type of data structure (like a heap). So using a heap brings only a gain of 0.1% if performance is determined by the number of comparison. But what is the difference in execution time between inserting an element in a 100 element heap and replacing an element in an 100 element array and finding its new minimum?
在理论层面:在堆中插入需要多少比较。我知道它是O(log(n))但常数因子有多大呢?我 在机器级别:缓存和分支预测对堆插入和数组中线性搜索的执行时间有什么影响? 在实现级别:库或编译器提供的堆数据结构中隐藏了哪些额外成本?
我认为,在人们试图估计100个元素堆和100个元素数组的性能之间的真正区别之前,这些都是必须回答的一些问题。所以做一个实验并测量真实的表现是有意义的。
两个选择:
(1)堆(priorityQueue)
维护最小堆的大小为100。遍历数组。一旦元素小于堆中的第一个元素,就替换它。
InSERT ELEMENT INTO HEAP: O(log100)
compare the first element: O(1)
There are n elements in the array, so the total would be O(nlog100), which is O(n)
(2)映射-约简模型。
这与hadoop中的单词计数示例非常相似。 映射工作:计算每个元素出现的频率或次数。 减约:获取顶部K元素。
通常,我会给招聘人员两个答案。他们喜欢什么就给什么。当然,映射缩减编码会很费事,因为您必须知道每个确切的参数。练习一下也无妨。 祝你好运。
Recently I am adapting a theory that all the problems in the world could be solved with O(1). And even this one. It wasn't clear from the question what is the range of the numbers. If the numbers are it range from 1 to 10, then probably the the top 100 largest numbers will be a group of 10. The chance that the highest number will be picked out of the 1 billion numbers when the highest number is very small in compare to to 1 billion are very big. So I would give this as an answer in that interview.
我意识到这被标记为“算法”,但会抛出一些其他选项,因为它可能也应该被标记为“面试”。
10亿个数字的来源是什么?如果它是一个数据库,那么“从表中按值顺序选择值desc limit 100”就可以很好地完成工作-可能有方言差异。
这是一次性的,还是会重复发生?如果重复,频率是多少?如果它是一次性的,数据在一个文件中,那么'cat srcfile | sort(根据需要选择)| head -100'将让你快速完成有偿工作,而计算机处理这些琐碎的琐事。
如果重复,你会建议选择任何合适的方法来获得初始答案并存储/缓存结果,这样你就可以连续地报告前100名。
Finally, there is this consideration. Are you looking for an entry level job and interviewing with a geeky manager or future co-worker? If so, then you can toss out all manner of approaches describing the relative technical pros and cons. If you are looking for a more managerial job, then approach it like a manager would, concerned with the development and maintenance costs of the solution, and say "thank you very much" and leave if that is the interviewer wants to focus on CS trivia. He and you would be unlikely to have much advancement potential there.
祝你下次面试好运。