我的面试问题是这样的:
给定一个包含40亿个整数的输入文件,提供一种算法来生成一个文件中不包含的整数。假设您有1gb内存。如果你只有10mb的内存,你会怎么做。
我的分析:
文件大小为4×109×4 bytes = 16gb。
我们可以进行外部排序,从而知道整数的范围。
我的问题是,在已排序的大整数集中检测缺失整数的最佳方法是什么?
我的理解(看完所有答案后):
假设我们讨论的是32位整数,有232 = 4*109个不同的整数。
情况1:我们有1gb = 1 * 109 * 8位= 80亿位内存。
解决方案:
如果我们用一位表示一个不同的整数,这就足够了。我们不需要排序。
实现:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
情形二:10mb内存= 10 * 106 * 8bits = 8000万bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
结论:
我们通过增加文件传递来减少内存。
为那些迟到的人澄清一下:这个问题,正如所问的,并不是说有一个整数没有包含在文件中——至少大多数人不是这么理解的。不过,评论线程中的许多评论都是关于任务的变化。不幸的是,把它引入评论线程的评论后来被它的作者删除了,所以现在看起来它的孤儿回复只是误解了一切。这很让人困惑,抱歉。
The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.
好的,这并没有经过充分的思考,因为它假设文件中的整数遵循某种静态分布。显然他们不需要这样做,但即使这样,也应该试试这个:
有≈43亿个32位整数。我们不知道它们在文件中是如何分布的,但最糟糕的情况是具有最高香农熵的情况:均匀分布。在这种情况下,任何一个整数不出现在文件中的概率为
((2³²-1)/2³²)⁰⁰⁰⁰≈.4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
内存消耗:几十个字节,复杂度:O(n),开销:neclectable,因为大部分时间将花费在不可避免的硬盘访问上,而不是比较int类型。
当我们不假设静态分布时,实际最坏的情况是每个整数都出现最大值。曾经,因为那时只有
1 - 4000000000/2³²≈6%
所有的整数都不会出现在文件中。因此,您需要更多的猜测,但这仍然不会消耗大量的内存。
关于这个问题的详细讨论已经在Jon Bentley的“第一栏”中讨论过。“编程珍珠”Addison-Wesley第3-10页
Bentley讨论了几种方法,包括外部排序,使用几个外部文件的归并排序等,但Bentley建议的最佳方法是使用位字段的单次传递算法,他幽默地称之为“神奇排序”:)
来看看这个问题,40亿个数字可以表示为:
4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
实现bitset的代码很简单:(取自解决方案页面)
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
Bentley的算法只对文件进行一次传递,在数组中设置适当的位,然后使用上面的测试宏检查这个数组以找到缺失的数字。
如果可用内存小于0.466 GB, Bentley建议使用k-pass算法,根据可用内存将输入划分为不同的范围。举一个非常简单的例子,如果只有1个字节(即处理8个数字的内存)可用,并且范围从0到31,我们将其分为0到7、8-15、16-22等范围,并在每次32/8 = 4次传递中处理这个范围。
HTH.
If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
但我们假设不是有40亿个数字;你有232 - 1这样的数字和不到10mb的RAM;所以任何小范围的整数都只有很小的可能性不包含这个数字。
如果保证列表中的每个int都是唯一的,那么可以将这些数字相加,并减去一个#,再减去完整的和(½)(232)(232 - 1)= 9223372034707292160,以找到缺少的int。但是,如果出现了两次int,则此方法将失败。
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
这种方法效率很低。我们在每一步中只使用两个整数(或者大约8字节的RAM和一个4字节(32位)整数)。更好的方法是将其划分为sqrt(232) = 216 = 65536个箱子,每个箱子中有65536个数字。每个bin需要4个字节来存储它的计数,因此需要218字节= 256 kB。因此,bin 0为(0 ~ 65535=216-1),bin 1为(216=65536 ~ 2*216-1=131071),bin 2为(2*216=131072 ~ 3*216-1=196607)。在python中,你会有这样的代码:
import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
if bin_count < 65536:
break # we have found an incomplete bin with missing ints (bin_num)
通读~ 40亿整数列表;然后计算216个容器中每个容器中有多少int,并找到一个不包含65536个数字的incomplete_bin。然后你再读一遍40亿的整数列表;但这次只注意整数在这个范围内;当你找到他们的时候,你会有点抓狂。
del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
if N // 65536 == bin_num:
my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
if not bit:
print bin_num*65536 + i
break
一些消除
一种方法是消除比特,但这实际上可能不会产生结果(很可能不会)。Psuedocode:
long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
val = val & ~fileVal;
if (val == 0) error;
}
位计数
跟踪比特数;用最少的比特来产生一个值。同样,这也不能保证生成正确的值。
范围的逻辑
跟踪列表的顺序范围(按开始顺序)。范围由结构定义:
struct Range
{
long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };
遍历文件中的每个值,并尝试将其从当前范围中删除。这个方法没有内存保证,但是它应该做得很好。
我认为这是一个已解决的问题(见上文),但还有一个有趣的情况需要记住,因为它可能会被问到:
如果恰好有4,294,967,295(2^32 - 1)个没有重复的32位整数,因此只有一个缺失,有一个简单的解决方案。
从0开始计算运行总数,对于文件中的每个整数,将该整数加上32位溢出(实际上,runningTotal = (runningTotal + nextInteger) % 4294967296)。一旦完成,将4294967296/2加到运行总数中,同样是32位溢出。用4294967296减去这个,结果就是缺少的整数。
“只缺少一个整数”的问题只需运行一次就可以解决,并且只有64位RAM专用于数据(运行总数为32位,读入下一个整数为32位)。
推论:如果我们不关心整数结果必须有多少位,那么更通用的规范非常容易匹配。我们只是生成一个足够大的整数,它不能包含在我们给定的文件中。同样,这只占用极小的RAM。请参阅伪代码。
# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
for (b=0; b<4; b++) {
print "2";
}
}
如果我们假设数字的范围总是2^n(2的偶次幂),那么排异或将成立(如另一张海报所示)。至于为什么,让我们证明一下:
这个理论
给定任何以0为基础的整数范围,其中有2^n个元素,其中缺少一个元素,您可以通过简单地将已知值乘在一起来找到缺少的元素。
证明
我们看一下n = 2。对于n=2,我们可以表示4个唯一的整数:0、1、2、3。它们有一个模式:
0-00
1-01
2-10
3-11
现在,如果我们仔细观察,每个位都被精确地设置了两次。因此,由于它被设置为偶数次,而这些数字的异或将产生0。如果缺少一个数字,排他性或将产生一个数字,当与缺少的数字排他性时,结果将为0。因此,丢失的号码和得到的排他号码完全相同。如果去掉2,得到的xor将是10(或2)。
现在来看n+1。让我们称每个比特在n中被设置的次数为x,称每个比特在n+1中被设置的次数为y。y的值将等于y = x * 2,因为有x个元素的n+1位设置为0,x个元素的n+1位设置为1。因为2x总是偶数,所以n+1总是将每一位设置为偶数次。
因此,既然n=2可行,n+1也可行,那么xor方法将适用于所有n>=2的值。
0为基础范围的算法
这很简单。它使用2*n位内存,因此对于任何<= 32的范围,2个32位整数都可以工作(忽略文件描述符消耗的任何内存)。它只对文件进行一次传递。
long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
result = result ^ supplied;
}
return result;
任意基范围的算法
该算法适用于任何起始数到任何结束数的范围,只要总范围等于2^n…这基本上是重新定义了范围,使最小值为0。但它确实需要遍历文件2次(第一次获取最小值,第二次计算缺少的int)。
long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
result = result ^ (supplied - offset);
}
return result + offset;
任意范围
我们可以将这种修改后的方法应用于任意范围的集合,因为所有范围都会至少穿过2^n的幂次。这只适用于有一个单一的缺失位。它需要对一个未排序的文件进行2次传递,但它每次都会找到一个丢失的数字:
long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
n++;
result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
result = result ^ (n++);
}
return result + offset;
基本上,在0附近重新设置范围。然后,它在计算异或时计算要追加的未排序值的数量。然后,它将未排序值的计数加1,以处理缺少的值(计算缺少的值)。然后,继续求n的值,每次加1,直到n是2的幂。然后将结果重新基于原始基数。完成了。
下面是我在PHP中测试的算法(使用数组而不是文件,但概念相同):
function find($array) {
$offset = min($array);
$n = 0;
$result = 0;
foreach ($array as $value) {
$result = $result ^ ($value - $offset);
$n++;
}
$n++; // This takes care of the missing value
while ($n == 1 || 0 != ($n & ($n - 1))) {
$result = $result ^ ($n++);
}
return $result + $offset;
}
在一个具有任何范围的值的数组中(我测试了包括负号),其中一个在该范围内是缺失的,它每次都找到正确的值。
另一种方法
既然我们可以使用外部排序,为什么不只是检查间隙呢?如果我们假设文件在运行这个算法之前已经排序:
long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
if (supplied != last + 1) {
return last + 1;
}
last = supplied;
}
// The range is contiguous, so what do we do here? Let's return last + 1:
return last + 1;
给定一个包含40亿个整数的输入文件,提供一个算法
生成文件中不包含的整数。假设你
有1gib的内存。接着问如果只有你会怎么做
10内存MiB。
文件大小为4 * 109 * 4字节= 16gib
如果是32位无符号整数
0 <= Number < 2^32
0 <= Number < 4,294,967,296
我建议的解决方案是:c++不进行错误检查
#include <vector>
#include <fstream>
#include <iostream>
using namespace std;
int main ()
{
const long SIZE = 1L << 32;
std::vector<bool> checker(SIZE, false);
std::ifstream infile("file.txt"); // TODO: error checking
unsigned int num = 0;
while (infile >> num)
{
checker[num] = true ;
}
infile.close();
// print missing numbers
for (long i = 0; i < SIZE; i++)
{
if (!checker[i])
cout << i << endl ;
}
return 0;
}
复杂性
Space ~ 232 bits = 229 Bytes = 219 KB = 29 MB = 1/2 GB
时间~单次通过
完整性~是