编写一个程序，从一个包含10亿个数字的数组中找出100个最大的数字

最近我参加了一个面试，面试官要求我“编写一个程序，从一个包含10亿个数字的数组中找出100个最大的数字”。

我只能给出一个蛮力解决方案，即以O(nlogn)时间复杂度对数组进行排序，并取最后100个数字。

Arrays.sort(array);

面试官正在寻找一个更好的时间复杂度，我尝试了几个其他的解决方案，但都没有回答他。有没有更好的时间复杂度解决方案?

当前回答

我对此的直接反应是使用堆，但有一种方法可以使用QuickSelect，而不需要在任何时候保留所有的输入值。

创建一个大小为200的数组，并用前200个输入值填充它。运行QuickSelect并丢弃低100个位置，留下100个空闲位置。读入接下来的100个输入值并再次运行QuickSelect。继续执行，直到以100个批次为单位运行整个输入。

最后是前100个值。对于N个值，您运行QuickSelect大约N/100次。每个快速选择的代价大约是某个常数的200倍，所以总代价是某个常数的2N倍。在我看来，输入的大小是线性的，不管我在这个解释中硬连接的参数大小是100。

2013-10-07 18:50:36

其他回答

这是谷歌或其他行业巨头提出的问题。也许下面的代码就是面试官想要的正确答案。时间成本和空间成本取决于输入数组中的最大数量。对于32位int数组输入，最大空间成本是4 * 125M字节，时间成本是5 *十亿。

public class TopNumber {
    public static void main(String[] args) {
        final int input[] = {2389,8922,3382,6982,5231,8934
                            ,4322,7922,6892,5224,4829,3829
                            ,6892,6872,4682,6723,8923,3492};
        //One int(4 bytes) hold 32 = 2^5 value,
        //About 4 * 125M Bytes
        //int sort[] = new int[1 << (32 - 5)];
        //Allocate small array for local test
        int sort[] = new int[1000];
        //Set all bit to 0
        for(int index = 0; index < sort.length; index++){
            sort[index] = 0;
        }
        for(int number : input){
            sort[number >>> 5] |= (1 << (number % 32));
        }
        int topNum = 0;
        outer:
        for(int index = sort.length - 1; index >= 0; index--){
            if(0 != sort[index]){
                for(int bit = 31; bit >= 0; bit--){
                    if(0 != (sort[index] & (1 << bit))){
                        System.out.println((index << 5) + bit);
                        topNum++;
                        if(topNum >= 3){
                            break outer;
                        }
                    }
                }
            }
        }
    }
}

2013-10-13 09:35:03

两个选择:

(1)堆(priorityQueue)

维护最小堆的大小为100。遍历数组。一旦元素小于堆中的第一个元素，就替换它。

InSERT ELEMENT INTO HEAP: O（log100）
compare the first element: O(1)
There are n elements in the array, so the total would be O(nlog100), which is O(n)

(2)映射-约简模型。

这与hadoop中的单词计数示例非常相似。映射工作:计算每个元素出现的频率或次数。减约:获取顶部K元素。

通常，我会给招聘人员两个答案。他们喜欢什么就给什么。当然，映射缩减编码会很费事，因为您必须知道每个确切的参数。练习一下也无妨。祝你好运。

2013-10-09 00:27:50

我做了我自己的代码，不确定它是否是“面试官”所寻找的

private static final int MAX=100;
 PriorityQueue<Integer> queue = new PriorityQueue<>(MAX);
        queue.add(array[0]);
        for (int i=1;i<array.length;i++)
        {

            if(queue.peek()<array[i])
            {
                if(queue.size() >=MAX)
                {
                    queue.poll();
                }
                queue.add(array[i]);

            }

        }

2015-05-11 21:04:20

Time ~ O(100 * N)
Space ~ O(100 + N)

创建一个包含100个空槽的空列表对于输入列表中的每个数字: 如果数字小于第一个，跳过否则用这个数字代替它然后，将数字通过相邻的交换;直到它比下一个小返回列表

注意:如果log(input-list.size) + c < 100，那么最佳的方法是对输入列表进行排序，然后拆分前100项。

2013-10-09 06:19:07

我知道这可能会被埋没，但这是我对一个基MSD的变化的想法。

伪代码:

//billion is the array of 1 billion numbers
int[] billion = getMyBillionNumbers();
//this assumes these are 32-bit integers and we are using hex digits
int[][] mynums = int[8][16];

for number in billion
    putInTop100Array(number)

function putInTop100Array(number){
    //basically if we got past all the digits successfully
    if(number == null)
        return true;
    msdIdx = getMsdIdx(number);
    msd = getMsd(number);
    //check if the idx above where we are is already full
    if(mynums[msdIdx][msd+1] > 99) {
        return false;
    } else if(putInTop100Array(removeMSD(number)){
        mynums[msdIdx][msd]++;
        //we've found 100 digits here, no need to keep looking below where we are
        if(mynums[msdIdx][msd] > 99){
           for(int i = 0; i < mds; i++){
              //making it 101 just so we can tell the difference
              //between numbers where we actually found 101, and 
              //where we just set it
              mynums[msdIdx][i] = 101;
           }
        }
        return true;
    }
    return false;
}

函数getMsdIdx(int num)将返回最高位(非零)的下标。函数getMsd(int num)将返回最高位。函数removeMSD(int num)将从一个数字中删除最有效的数字并返回该数字(如果删除最有效的数字后什么都没有留下，则返回null)。

完成后，剩下的就是遍历mynums以获取前100位数字。这大概是这样的:

int[] nums = int[100];
int idx = 0;
for(int i = 7; i >= 0; i--){
    int timesAdded = 0;
    for(int j = 16; j >=0 && timesAdded < 100; j--){
        for(int k = mynums[i][j]; k > 0; k--){
            nums[idx] += j;
            timesAdded++;
            idx++;
        }
    }
}

我需要注意的是，尽管上面的图看起来时间复杂度很高，但实际上它只有O(7*100)左右。

快速解释一下这是为了做什么: 从本质上讲，这个系统试图基于数字中数字的索引和数字的值来使用2d数组中的每个数字。它使用这些值作为索引来跟踪数组中插入了多少数值。当达到100时，它会关闭所有“较低的分支”。

这个算法的时间大概是O(十亿*log(16)*7)+O(100)。我可能是错的。此外，这很可能需要调试，因为它有点复杂，我只是把它写在我的头上。

编辑:没有解释的反对票是没有帮助的。如果你认为这个答案不正确，请留下评论。我很确定，StackOverflow甚至告诉你这样做，当你向下投票。

2013-10-08 23:53:16

编写一个程序，从一个包含10亿个数字的数组中找出100个最大的数字

推荐文章

最新文章

标签