生成一个不在40亿个给定整数中的整数

我的面试问题是这样的:

给定一个包含40亿个整数的输入文件，提供一种算法来生成一个文件中不包含的整数。假设您有1gb内存。如果你只有10mb的内存，你会怎么做。

我的分析:

文件大小为4×109×4 bytes = 16gb。

我们可以进行外部排序，从而知道整数的范围。

我的问题是，在已排序的大整数集中检测缺失整数的最佳方法是什么?

我的理解(看完所有答案后):

假设我们讨论的是32位整数，有232 = 4*109个不同的整数。

情况1:我们有1gb = 1 * 109 * 8位= 80亿位内存。

解决方案:

如果我们用一位表示一个不同的整数，这就足够了。我们不需要排序。

实现:

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

情形二:10mb内存= 10 * 106 * 8bits = 8000万bits

Solution: For all possible 16-bit prefixes, there are 216 number of integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket. Build the counter of each bucket through the first pass through the file. Scan the buckets, find the first one who has less than 65536 hit. Build new buckets whose high 16-bit prefixes are we found in step2 through second pass of the file Scan the buckets built in step3, find the first bucket which doesnt have a hit. The code is very similar to above one.

结论: 我们通过增加文件传递来减少内存。

为那些迟到的人澄清一下:这个问题，正如所问的，并不是说有一个整数没有包含在文件中——至少大多数人不是这么理解的。不过，评论线程中的许多评论都是关于任务的变化。不幸的是，把它引入评论线程的评论后来被它的作者删除了，所以现在看起来它的孤儿回复只是误解了一切。这很让人困惑，抱歉。

当前回答

为什么要把它弄得这么复杂?你要的是一个文件中没有的整数?

根据指定的规则，您唯一需要存储的是到目前为止在文件中遇到的最大整数。读取整个文件后，返回一个大于1的数字。

不存在触及maxint或任何东西的风险，因为根据规则，对算法返回的整数或数字的大小没有限制。

2011-08-23 14:38:13

其他回答

如果在[0,2 ^x - 1]范围内少了一个整数，那么就把它们一起xor。例如:

>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5

(我知道这并不能完全回答这个问题，但这是对一个非常相似的问题的一个很好的回答。)

2011-08-24 02:43:07

您可以使用位标志来标记一个整数是否存在。

遍历整个文件后，扫描每个位以确定数字是否存在。

假设每个整数是32位，如果进行了位标记，它们将方便地放入1gb RAM中。

2011-08-22 21:18:40

你不需要对它们排序，只需要重复划分它们的子集。

The first step is like the first pass of a quicksort. Pick one of the integers, x, and using it make a pass through the array to put all the values less than x to its left and values more than x to its right. Find which side of x has the greatest number of available slots (integers not in the list). This is easily computable by comparing the value of x with its position. Then repeat the partition on the sub-list on that side of x. Then repeat the partition on the sub-sub list with the greatest number of available integers, etc. Total number of compares to get down to an empty range should be about 4 billion, give or take.

2011-08-25 05:52:51

根据原题中目前的措辞，最简单的解决方法是:

找到文件中的最大值，然后加上1。

2011-08-23 03:04:09

我认为这是一个已解决的问题(见上文)，但还有一个有趣的情况需要记住，因为它可能会被问到:

如果恰好有4,294,967,295(2^32 - 1)个没有重复的32位整数，因此只有一个缺失，有一个简单的解决方案。

从0开始计算运行总数，对于文件中的每个整数，将该整数加上32位溢出(实际上，runningTotal = (runningTotal + nextInteger) % 4294967296)。一旦完成，将4294967296/2加到运行总数中，同样是32位溢出。用4294967296减去这个，结果就是缺少的整数。

“只缺少一个整数”的问题只需运行一次就可以解决，并且只有64位RAM专用于数据(运行总数为32位，读入下一个整数为32位)。

推论:如果我们不关心整数结果必须有多少位，那么更通用的规范非常容易匹配。我们只是生成一个足够大的整数，它不能包含在我们给定的文件中。同样，这只占用极小的RAM。请参阅伪代码。

# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
  for (b=0; b<4; b++) {
    print "2";
  }
}

2011-08-24 10:37:54

生成一个不在40亿个给定整数中的整数

推荐文章

最新文章

标签