用1mb RAM对100万个8位数进行排序

我有一台有1mb内存的电脑，没有其他本地存储。我必须使用它通过TCP连接接受100万个8位十进制数字，对它们进行排序，然后通过另一个TCP连接发送排序的列表。

数字列表可能包含重复的，我不能丢弃。代码将放在ROM中，所以我不需要从1 MB中减去我的代码的大小。我已经有了驱动以太网端口和处理TCP/IP连接的代码，它需要2 KB的状态数据，包括1 KB的缓冲区，代码将通过它读取和写入数据。这个问题有解决办法吗?

问答来源:

slashdot.org

cleaton.net

当前回答

如果我们对这些数字一无所知，我们就会受到以下约束:

我们需要在排序之前加载所有的数字，这组数字是不可压缩的。

如果这些假设成立，则无法执行您的任务，因为您将需要至少26,575,425位的存储空间(3,321,929字节)。

你能跟我们说说你的数据吗?

2012-10-22 09:30:31

其他回答

我认为解决方案是结合视频编码的技术，即离散余弦变换。在数字视频中，不是将视频的亮度或颜色的变化记录为常规值，如110 112 115 116，而是从最后一个中减去每一个(类似于运行长度编码)。110 112 115 116变成110 2 3 1。这些值，2,3 1比原始值需要更少的比特。

So lets say we create a list of the input values as they arrive on the socket. We are storing in each element, not the value, but the offset of the one before it. We sort as we go, so the offsets are only going to be positive. But the offset could be 8 decimal digits wide which this fits in 3 bytes. Each element can't be 3 bytes, so we need to pack these. We could use the top bit of each byte as a "continue bit", indicating that the next byte is part of the number and the lower 7 bits of each byte need to be combined. zero is valid for duplicates.

当列表填满时，数字之间的距离应该越来越近，这意味着平均只有1个字节用于确定到下一个值的距离。7位值和1位偏移(如果方便的话)，但可能存在一个“继续”值需要少于8位的最佳点。

总之，我做了一些实验。我使用随机数生成器，我可以将100万个排序过的8位十进制数字放入大约1279000字节。每个数字之间的平均间隔始终是99…

public class Test {
    public static void main(String[] args) throws IOException {
        // 1 million values
        int[] values = new int[1000000];

        // create random values up to 8 digits lrong
        Random random = new Random();
        for (int x=0;x<values.length;x++) {
            values[x] = random.nextInt(100000000);
        }
        Arrays.sort(values);

        ByteArrayOutputStream baos = new ByteArrayOutputStream();

        int av = 0;    
        writeCompact(baos, values[0]);     // first value
        for (int x=1;x<values.length;x++) {
            int v = values[x] - values[x-1];  // difference
            av += v;
            System.out.println(values[x] + " diff " + v);
            writeCompact(baos, v);
        }

        System.out.println("Average offset " + (av/values.length));
        System.out.println("Fits in " + baos.toByteArray().length);
    }

    public static void writeCompact(OutputStream os, long value) throws IOException {
        do {
            int b = (int) value & 0x7f;
            value = (value & 0x7fffffffffffffffl) >> 7;
            os.write(value == 0 ? b : (b | 0x80));
        } while (value != 0);
    }
}

2012-10-22 08:33:28

如果输入流可以接收几次，这就容易多了(没有关于这方面的信息，想法和时间性能问题)。然后，我们可以数小数。有了计数值，就很容易生成输出流。通过计算值来压缩。这取决于输入流中的内容。

2012-10-20 22:33:55

我认为从组合学的角度来思考这个问题:有多少种可能的排序数字的组合?如果我们给出的组合是0,0,0 ....，0代码0，和0,0,0，…，1代码1，和999999999,99999999，…99999999是代码N, N是什么?换句话说，结果空间有多大?

Well, one way to think about this is noticing that this is a bijection of the problem of finding the number of monotonic paths in an N x M grid, where N = 1,000,000 and M = 100,000,000. In other words, if you have a grid that is 1,000,000 wide and 100,000,000 tall, how many shortest paths from the bottom left to the top right are there? Shortest paths of course require you only ever either move right or up (if you were to move down or left you would be undoing previously accomplished progress). To see how this is a bijection of our number sorting problem, observe the following:

您可以将路径中的任何水平支腿想象成排序中的一个数字，其中支腿的Y位置表示值。

所以如果路径只是向右移动一直到最后，然后一直跳到顶部，这相当于顺序为0,0,0，…，0。相反，如果它开始时一直跳到顶部，然后向右移动1,000,000次，这相当于999999999,99999999，……, 99999999。它向右移动一次，然后向上移动一次，然后向右移动一次，然后向上移动一次，等等，直到最后(然后必然会一直跳到顶部)，相当于0,1,2,3，…，999999。

幸运的是，这个问题已经解决了，这样的网格有(N + M)个选择(M)条路径:

(1,000,000 + 100,000,000)选择(100,000,000)~= 2.27 * 10^2436455

N因此等于2.27 * 10^2436455，因此代码0表示0,0,0，…，0和代码2.27 * 10^2436455，一些变化表示999999999,99999999，…, 99999999。

为了存储从0到2.27 * 10^2436455的所有数字，您需要lg2(2.27 * 10^2436455) = 8.0937 * 10^6位。

1兆字节= 8388608比特> 8093700比特

这样看来，我们至少有足够的空间来存储结果!当然，有趣的部分是在数字流进来时进行排序。不确定最好的方法是我们有294908位剩余。我想一个有趣的技巧是在每个点都假设这是整个排序，找到该排序的代码，然后当你收到一个新数字时，返回并更新之前的代码。手，手，手。

2012-10-21 22:46:38

现在的目标是一个实际的解决方案，覆盖所有可能的情况下，输入在8位数范围内，只有1MB的RAM。注:工作正在进行中，明天继续。使用对已排序整型的增量进行算术编码，对于1M个已排序整型，最坏的情况是每个条目花费大约7位(因为99999999/1000000是99，而log2(99)几乎是7位)。

但是你需要将1m个整数排序到7位或8位!级数越短，delta就越大，因此每个元素的比特数就越多。

我正在努力尽可能多地压缩(几乎)在原地。第一批接近250K的整数最多每个需要大约9位。因此结果大约需要275KB。重复使用剩余的空闲内存几次。然后解压缩-就地合并-压缩这些压缩块。这很难，但也是可能的。我认为。

合并后的列表将越来越接近每整数7位的目标。但是我不知道合并循环需要多少次迭代。也许3。

但是算术编码实现的不精确性可能使它不可能实现。如果这个问题是可能的，它将是非常紧张的。

有志愿者吗?

2012-10-21 23:12:29

请参阅第一个正确答案或后面带有算术编码的答案。下面你可能会发现一些有趣的，但不是100%防弹的解决方案。

这是一个非常有趣的任务，这里有另一个解决方案。我希望有人会觉得这个结果有用(或者至少有趣)。

阶段1:初始数据结构，粗略压缩方法，基本结果

Let's do some simple math: we have 1M (1048576 bytes) of RAM initially available to store 10^6 8 digit decimal numbers. [0;99999999]. So to store one number 27 bits are needed (taking the assumption that unsigned numbers will be used). Thus, to store a raw stream ~3.5M of RAM will be needed. Somebody already said it doesn't seem to be feasible, but I would say the task can be solved if the input is "good enough". Basically, the idea is to compress the input data with compression factor 0.29 or higher and do sorting in a proper manner.

让我们先解决压缩问题。有一些相关的测试已经可用:

http://www.theeggeadventure.com/wikimedia/index.php/Java_Data_Compression

“我运行了一个测试，压缩100万个连续整数使用各种形式的压缩。结果如下:

None     4000027
Deflate  2006803
Filtered 1391833
BZip2    427067
Lzma     255040

看起来LZMA (Lempel-Ziv-Markov链算法)是一个很好的选择。我准备了一个简单的PoC，但仍有一些细节需要强调:

Memory is limited so the idea is to presort numbers and use compressed buckets (dynamic size) as temporary storage It is easier to achieve a better compression factor with presorted data, so there is a static buffer for each bucket (numbers from the buffer are to be sorted before LZMA) Each bucket holds a specific range, so the final sort can be done for each bucket separately Bucket's size can be properly set, so there will be enough memory to decompress stored data and do the final sort for each bucket separately

请注意，所附的代码是一个POC，它不能用作最终解决方案，它只是演示了使用几个较小的缓冲区以某种最佳方式(可能是压缩)存储预排序数字的想法。LZMA并不是最终的解决方案。它被用作向这个PoC引入压缩的最快方法。

请看下面的PoC代码(请注意它只是一个演示，要编译它将需要LZMA-Java):

public class MemorySortDemo {

static final int NUM_COUNT = 1000000;
static final int NUM_MAX   = 100000000;

static final int BUCKETS      = 5;
static final int DICT_SIZE    = 16 * 1024; // LZMA dictionary size
static final int BUCKET_SIZE  = 1024;
static final int BUFFER_SIZE  = 10 * 1024;
static final int BUCKET_RANGE = NUM_MAX / BUCKETS;

static class Producer {
    private Random random = new Random();
    public int produce() { return random.nextInt(NUM_MAX); }
}

static class Bucket {
    public int size, pointer;
    public int[] buffer = new int[BUFFER_SIZE];

    public ByteArrayOutputStream tempOut = new ByteArrayOutputStream();
    public DataOutputStream tempDataOut = new DataOutputStream(tempOut);
    public ByteArrayOutputStream compressedOut = new ByteArrayOutputStream();

    public void submitBuffer() throws IOException {
        Arrays.sort(buffer, 0, pointer);

        for (int j = 0; j < pointer; j++) {
            tempDataOut.writeInt(buffer[j]);
            size++;
        }            
        pointer = 0;
    }

    public void write(int value) throws IOException {
        if (isBufferFull()) {
            submitBuffer();
        }
        buffer[pointer++] = value;
    }

    public boolean isBufferFull() {
        return pointer == BUFFER_SIZE;
    }

    public byte[] compressData() throws IOException {
        tempDataOut.close();
        return compress(tempOut.toByteArray());
    }        

    private byte[] compress(byte[] input) throws IOException {
        final BufferedInputStream in = new BufferedInputStream(new ByteArrayInputStream(input));
        final DataOutputStream out = new DataOutputStream(new BufferedOutputStream(compressedOut));

        final Encoder encoder = new Encoder();
        encoder.setEndMarkerMode(true);
        encoder.setNumFastBytes(0x20);
        encoder.setDictionarySize(DICT_SIZE);
        encoder.setMatchFinder(Encoder.EMatchFinderTypeBT4);

        ByteArrayOutputStream encoderPrperties = new ByteArrayOutputStream();
        encoder.writeCoderProperties(encoderPrperties);
        encoderPrperties.flush();
        encoderPrperties.close();

        encoder.code(in, out, -1, -1, null);
        out.flush();
        out.close();
        in.close();

        return encoderPrperties.toByteArray();
    }

    public int[] decompress(byte[] properties) throws IOException {
        InputStream in = new ByteArrayInputStream(compressedOut.toByteArray());
        ByteArrayOutputStream data = new ByteArrayOutputStream(10 * 1024);
        BufferedOutputStream out = new BufferedOutputStream(data);

        Decoder decoder = new Decoder();
        decoder.setDecoderProperties(properties);
        decoder.code(in, out, 4 * size);

        out.flush();
        out.close();
        in.close();

        DataInputStream input = new DataInputStream(new ByteArrayInputStream(data.toByteArray()));
        int[] array = new int[size];
        for (int k = 0; k < size; k++) {
            array[k] = input.readInt();
        }

        return array;
    }
}

static class Sorter {
    private Bucket[] bucket = new Bucket[BUCKETS];

    public void doSort(Producer p, Consumer c) throws IOException {

        for (int i = 0; i < bucket.length; i++) {  // allocate buckets
            bucket[i] = new Bucket();
        }

        for(int i=0; i< NUM_COUNT; i++) {         // produce some data
            int value = p.produce();
            int bucketId = value/BUCKET_RANGE;
            bucket[bucketId].write(value);
            c.register(value);
        }

        for (int i = 0; i < bucket.length; i++) { // submit non-empty buffers
            bucket[i].submitBuffer();
        }

        byte[] compressProperties = null;
        for (int i = 0; i < bucket.length; i++) { // compress the data
            compressProperties = bucket[i].compressData();
        }

        printStatistics();

        for (int i = 0; i < bucket.length; i++) { // decode & sort buckets one by one
            int[] array = bucket[i].decompress(compressProperties);
            Arrays.sort(array);

            for(int v : array) {
                c.consume(v);
            }
        }
        c.finalCheck();
    }

    public void printStatistics() {
        int size = 0;
        int sizeCompressed = 0;

        for (int i = 0; i < BUCKETS; i++) {
            int bucketSize = 4*bucket[i].size;
            size += bucketSize;
            sizeCompressed += bucket[i].compressedOut.size();

            System.out.println("  bucket[" + i
                    + "] contains: " + bucket[i].size
                    + " numbers, compressed size: " + bucket[i].compressedOut.size()
                    + String.format(" compression factor: %.2f", ((double)bucket[i].compressedOut.size())/bucketSize));
        }

        System.out.println(String.format("Data size: %.2fM",(double)size/(1014*1024))
                + String.format(" compressed %.2fM",(double)sizeCompressed/(1014*1024))
                + String.format(" compression factor %.2f",(double)sizeCompressed/size));
    }
}

static class Consumer {
    private Set<Integer> values = new HashSet<>();

    int v = -1;
    public void consume(int value) {
        if(v < 0) v = value;

        if(v > value) {
            throw new IllegalArgumentException("Current value is greater than previous: " + v + " > " + value);
        }else{
            v = value;
            values.remove(value);
        }
    }

    public void register(int value) {
        values.add(value);
    }

    public void finalCheck() {
        System.out.println(values.size() > 0 ? "NOT OK: " + values.size() : "OK!");
    }
}

public static void main(String[] args) throws IOException {
    Producer p = new Producer();
    Consumer c = new Consumer();
    Sorter sorter = new Sorter();

    sorter.doSort(p, c);
}
}

对于随机数，它产生如下结果:

bucket[0] contains: 200357 numbers, compressed size: 353679 compression factor: 0.44
bucket[1] contains: 199465 numbers, compressed size: 352127 compression factor: 0.44
bucket[2] contains: 199682 numbers, compressed size: 352464 compression factor: 0.44
bucket[3] contains: 199949 numbers, compressed size: 352947 compression factor: 0.44
bucket[4] contains: 200547 numbers, compressed size: 353914 compression factor: 0.44
Data size: 3.85M compressed 1.70M compression factor 0.44

对于一个简单的升序序列(使用一个桶)，它产生:

bucket[0] contains: 1000000 numbers, compressed size: 256700 compression factor: 0.06
Data size: 3.85M compressed 0.25M compression factor 0.06

EDIT

结论:

不要试图欺骗大自然使用更简单的压缩和更低的内存占用确实需要一些额外的线索。普通的防弹方案似乎并不可行。

第二阶段:强化压缩，最终结论

正如在前一节中已经提到的，任何合适的压缩技术都可以使用。因此，让我们摒弃LZMA，转而采用更简单、更好(如果可能的话)的方法。有很多好的解决方案，包括算术编码，基树等。

无论如何，简单但有用的编码方案将比另一个外部库更能说明问题，它提供了一些漂亮的算法。实际的解决方案非常简单:因为存在部分排序的数据桶，所以可以使用增量而不是数字。

随机输入测试结果稍好:

bucket[0] contains: 10103 numbers, compressed size: 13683 compression factor: 0.34
bucket[1] contains: 9885 numbers, compressed size: 13479 compression factor: 0.34
...
bucket[98] contains: 10026 numbers, compressed size: 13612 compression factor: 0.34
bucket[99] contains: 10058 numbers, compressed size: 13701 compression factor: 0.34
Data size: 3.85M compressed 1.31M compression factor 0.34

示例代码

  public static void encode(int[] buffer, int length, BinaryOut output) {
    short size = (short)(length & 0x7FFF);

    output.write(size);
    output.write(buffer[0]);

    for(int i=1; i< size; i++) {
        int next = buffer[i] - buffer[i-1];
        int bits = getBinarySize(next);

        int len = bits;

        if(bits > 24) {
          output.write(3, 2);
          len = bits - 24;
        }else if(bits > 16) {
          output.write(2, 2);
          len = bits-16;
        }else if(bits > 8) {
          output.write(1, 2);
          len = bits - 8;
        }else{
          output.write(0, 2);
        }

        if (len > 0) {
            if ((len % 2) > 0) {
                len = len / 2;
                output.write(len, 2);
                output.write(false);
            } else {
                len = len / 2 - 1;
                output.write(len, 2);
            }

            output.write(next, bits);
        }
    }
}

public static short decode(BinaryIn input, int[] buffer, int offset) {
    short length = input.readShort();
    int value = input.readInt();
    buffer[offset] = value;

    for (int i = 1; i < length; i++) {
        int flag = input.readInt(2);

        int bits;
        int next = 0;
        switch (flag) {
            case 0:
                bits = 2 * input.readInt(2) + 2;
                next = input.readInt(bits);
                break;
            case 1:
                bits = 8 + 2 * input.readInt(2) +2;
                next = input.readInt(bits);
                break;
            case 2:
                bits = 16 + 2 * input.readInt(2) +2;
                next = input.readInt(bits);
                break;
            case 3:
                bits = 24 + 2 * input.readInt(2) +2;
                next = input.readInt(bits);
                break;
        }

        buffer[offset + i] = buffer[offset + i - 1] + next;
    }

   return length;
}

请注意，这种方法:

不消耗大量内存使用流提供了不那么坏的结果

完整的代码可以在这里找到，BinaryInput和BinaryOutput实现可以在这里找到

最终结论

没有最终结论:)有时候，从元级别的角度来回顾一下任务，这确实是个好主意。

花点时间完成这个任务很有趣。顺便说一下，下面有很多有趣的答案。感谢您的关注和愉快的编码。

2012-10-19 20:43:59

用1mb RAM对100万个8位数进行排序

推荐文章

最新文章

标签