我的面试问题是这样的:

给定一个包含40亿个整数的输入文件,提供一种算法来生成一个文件中不包含的整数。假设您有1gb内存。如果你只有10mb的内存,你会怎么做。

我的分析:

文件大小为4×109×4 bytes = 16gb。

我们可以进行外部排序,从而知道整数的范围。

我的问题是,在已排序的大整数集中检测缺失整数的最佳方法是什么?

我的理解(看完所有答案后):

假设我们讨论的是32位整数,有232 = 4*109个不同的整数。

情况1:我们有1gb = 1 * 109 * 8位= 80亿位内存。

解决方案:

如果我们用一位表示一个不同的整数,这就足够了。我们不需要排序。

实现:

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

情形二:10mb内存= 10 * 106 * 8bits = 8000万bits

Solution: For all possible 16-bit prefixes, there are 216 number of integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket. Build the counter of each bucket through the first pass through the file. Scan the buckets, find the first one who has less than 65536 hit. Build new buckets whose high 16-bit prefixes are we found in step2 through second pass of the file Scan the buckets built in step3, find the first bucket which doesnt have a hit. The code is very similar to above one.

结论: 我们通过增加文件传递来减少内存。


为那些迟到的人澄清一下:这个问题,正如所问的,并不是说有一个整数没有包含在文件中——至少大多数人不是这么理解的。不过,评论线程中的许多评论都是关于任务的变化。不幸的是,把它引入评论线程的评论后来被它的作者删除了,所以现在看起来它的孤儿回复只是误解了一切。这很让人困惑,抱歉。


当前回答

出于某种原因,当我读到这个问题时,我想到了对角化。假设是任意大的整数。

Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.

其他回答

好的,这并没有经过充分的思考,因为它假设文件中的整数遵循某种静态分布。显然他们不需要这样做,但即使这样,也应该试试这个:


有≈43亿个32位整数。我们不知道它们在文件中是如何分布的,但最糟糕的情况是具有最高香农熵的情况:均匀分布。在这种情况下,任何一个整数不出现在文件中的概率为

((2³²-1)/2³²)⁰⁰⁰⁰≈.4

The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).

内存消耗:几十个字节,复杂度:O(n),开销:neclectable,因为大部分时间将花费在不可避免的硬盘访问上,而不是比较int类型。 当我们不假设静态分布时,实际最坏的情况是每个整数都出现最大值。曾经,因为那时只有 1 - 4000000000/2³²≈6% 所有的整数都不会出现在文件中。因此,您需要更多的猜测,但这仍然不会消耗大量的内存。

我将回答1gb版本:

这个问题没有足够的信息,所以我将先说明一些假设:

整数为32位,取值范围为-2,147,483,648 ~ 2,147,483,647。

伪代码:

var bitArray = new bit[4294967296];  // 0.5 GB, initialized to all 0s.

foreach (var number in file) {
    bitArray[number + 2147483648] = 1;   // Shift all numbers so they start at 0.
}

for (var i = 0; i < 4294967296; i++) {
    if (bitArray[i] == 0) {
        return i - 2147483648;
    }
}

这是个陷阱问题,除非引用不当。只需要通读文件一次,得到最大整数n,并返回n+1。

当然,您需要一个备份计划,以防n+1导致整数溢出。

正如Ryan所说,基本上,对文件进行排序,然后遍历整数,当一个值被跳过时,你就有了:)

EDIT at downvotes: OP提到文件可以排序,所以这是一个有效的方法。

2128*1018 + 1(即(28)16*1018 + 1)——这难道不是今天的普遍答案吗?这表示一个不能保存在16eb文件中的数字,这是当前任何文件系统中的最大文件大小。