我的面试问题是这样的:

给定一个包含40亿个整数的输入文件,提供一种算法来生成一个文件中不包含的整数。假设您有1gb内存。如果你只有10mb的内存,你会怎么做。

我的分析:

文件大小为4×109×4 bytes = 16gb。

我们可以进行外部排序,从而知道整数的范围。

我的问题是,在已排序的大整数集中检测缺失整数的最佳方法是什么?

我的理解(看完所有答案后):

假设我们讨论的是32位整数,有232 = 4*109个不同的整数。

情况1:我们有1gb = 1 * 109 * 8位= 80亿位内存。

解决方案:

如果我们用一位表示一个不同的整数,这就足够了。我们不需要排序。

实现:

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

情形二:10mb内存= 10 * 106 * 8bits = 8000万bits

Solution: For all possible 16-bit prefixes, there are 216 number of integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket. Build the counter of each bucket through the first pass through the file. Scan the buckets, find the first one who has less than 65536 hit. Build new buckets whose high 16-bit prefixes are we found in step2 through second pass of the file Scan the buckets built in step3, find the first bucket which doesnt have a hit. The code is very similar to above one.

结论: 我们通过增加文件传递来减少内存。


为那些迟到的人澄清一下:这个问题,正如所问的,并不是说有一个整数没有包含在文件中——至少大多数人不是这么理解的。不过,评论线程中的许多评论都是关于任务的变化。不幸的是,把它引入评论线程的评论后来被它的作者删除了,所以现在看起来它的孤儿回复只是误解了一切。这很让人困惑,抱歉。


当前回答

为了完整起见,这里有另一个非常简单的解决方案,它很可能需要很长时间才能运行,但只使用很少的内存。

设所有可能的整数为从int_min到int_max的范围,和 bool isNotInFile(integer)一个函数,如果文件不包含某个整数,则返回true,否则返回false(通过将该特定整数与文件中的每个整数进行比较)

for (integer i = int_min; i <= int_max; ++i)
{
    if (isNotInFile(i)) {
        return i;
    }
}

其他回答

为了完整起见,这里有另一个非常简单的解决方案,它很可能需要很长时间才能运行,但只使用很少的内存。

设所有可能的整数为从int_min到int_max的范围,和 bool isNotInFile(integer)一个函数,如果文件不包含某个整数,则返回true,否则返回false(通过将该特定整数与文件中的每个整数进行比较)

for (integer i = int_min; i <= int_max; ++i)
{
    if (isNotInFile(i)) {
        return i;
    }
}

出于某种原因,当我读到这个问题时,我想到了对角化。假设是任意大的整数。

Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.

你不需要对它们排序,只需要重复划分它们的子集。

The first step is like the first pass of a quicksort. Pick one of the integers, x, and using it make a pass through the array to put all the values less than x to its left and values more than x to its right. Find which side of x has the greatest number of available slots (integers not in the list). This is easily computable by comparing the value of x with its position. Then repeat the partition on the sub-list on that side of x. Then repeat the partition on the sub-sub list with the greatest number of available integers, etc. Total number of compares to get down to an empty range should be about 4 billion, give or take.

一些消除

一种方法是消除比特,但这实际上可能不会产生结果(很可能不会)。Psuedocode:

long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
    val = val & ~fileVal;
    if (val == 0) error;
}

位计数

跟踪比特数;用最少的比特来产生一个值。同样,这也不能保证生成正确的值。

范围的逻辑

跟踪列表的顺序范围(按开始顺序)。范围由结构定义:

struct Range
{
  long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };

遍历文件中的每个值,并尝试将其从当前范围中删除。这个方法没有内存保证,但是它应该做得很好。

我可能读得太仔细了,但问题是“生成一个不包含在文件中的整数”。我只是对列表进行排序,并在最大的条目上加1。Bam,一个没有包含在文件中的整数。