如何将任意长度的列表拆分为大小相等的块?


请参阅如果数据结果将直接用于循环,并且不需要存储,则如何以块形式遍历列表。

对于字符串输入的同一问题,请参见每n个字符拆分字符串?。相同的技术通常适用,但也有一些变化。


当前回答

一种老式的方法,不需要itertools,但仍然可以使用任意生成器:

def chunks(g, n):
  """divide a generator 'g' into small chunks
  Yields:
    a chunk that has 'n' or less items
  """
  n = max(1, n)
  buff = []
  for item in g:
    buff.append(item)
    if len(buff) == n:
      yield buff
      buff = []
  if buff:
    yield buff

其他回答

因为这里的每个人都在谈论迭代器。boltons有一个完美的方法,叫做iterutils.chunked_iter。

from boltons import iterutils

list(iterutils.chunked_iter(list(range(50)), 11))

输出:

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
 [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32],
 [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43],
 [44, 45, 46, 47, 48, 49]]

但如果您不想在内存上手下留情,您可以使用旧方法,首先使用iterutils.chunked存储完整列表。

由于我必须这样做,下面是我的解决方案,给出了一个生成器和一个批量大小:

def pop_n_elems_from_generator(g, n):
    elems = []
    try:
        for idx in xrange(0, n):
            elems.append(g.next())
        return elems
    except StopIteration:
        return elems

我不喜欢按块大小拆分元素的想法,例如,脚本可以将101到3个块划分为[50,50,1]。为了我的需要,我需要按比例分配,保持秩序不变。首先我写了自己的剧本,效果很好,而且很简单。但我后来看到了这个答案,剧本比我的好,我想是这样的。这是我的脚本:

def proportional_dividing(N, n):
    """
    N - length of array (bigger number)
    n - number of chunks (smaller number)
    output - arr, containing N numbers, diveded roundly to n chunks
    """
    arr = []
    if N == 0:
        return arr
    elif n == 0:
        arr.append(N)
        return arr
    r = N // n
    for i in range(n-1):
        arr.append(r)
    arr.append(N-r*(n-1))

    last_n = arr[-1]
    # last number always will be r <= last_n < 2*r
    # when last_n == r it's ok, but when last_n > r ...
    if last_n > r:
        # ... and if difference too big (bigger than 1), then
        if abs(r-last_n) > 1:
            #[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 7] # N=29, n=12
            # we need to give unnecessary numbers to first elements back
            diff = last_n - r
            for k in range(diff):
                arr[k] += 1
            arr[-1] = r
            # and we receive [3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2]
    return arr

def split_items(items, chunks):
    arr = proportional_dividing(len(items), chunks)
    splitted = []
    for chunk_size in arr:
        splitted.append(items[:chunk_size])
        items = items[chunk_size:]
    print(splitted)
    return splitted

items = [1,2,3,4,5,6,7,8,9,10,11]
chunks = 3
split_items(items, chunks)
split_items(['a','b','c','d','e','f','g','h','i','g','k','l', 'm'], 3)
split_items(['a','b','c','d','e','f','g','h','i','g','k','l', 'm', 'n'], 3)
split_items(range(100), 4)
split_items(range(99), 4)
split_items(range(101), 4)

和输出:

[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11]]
[['a', 'b', 'c', 'd'], ['e', 'f', 'g', 'h'], ['i', 'g', 'k', 'l', 'm']]
[['a', 'b', 'c', 'd', 'e'], ['f', 'g', 'h', 'i', 'g'], ['k', 'l', 'm', 'n']]
[range(0, 25), range(25, 50), range(50, 75), range(75, 100)]
[range(0, 25), range(25, 50), range(50, 75), range(75, 99)]
[range(0, 25), range(25, 50), range(50, 75), range(75, 101)]

任何可迭代的通用分块器,使用户可以选择如何在结尾处处理部分分块。

在Python 3上测试。

分块.py

from enum import Enum

class PartialChunkOptions(Enum):
    INCLUDE = 0
    EXCLUDE = 1
    PAD = 2
    ERROR = 3

class PartialChunkException(Exception):
    pass

def chunker(iterable, n, on_partial=PartialChunkOptions.INCLUDE, pad=None):
    """
    A chunker yielding n-element lists from an iterable, with various options
    about what to do about a partial chunk at the end.

    on_partial=PartialChunkOptions.INCLUDE (the default):
                     include the partial chunk as a short (<n) element list

    on_partial=PartialChunkOptions.EXCLUDE
                     do not include the partial chunk

    on_partial=PartialChunkOptions.PAD
                     pad to an n-element list 
                     (also pass pad=<pad_value>, default None)

    on_partial=PartialChunkOptions.ERROR
                     raise a RuntimeError if a partial chunk is encountered
    """

    on_partial = PartialChunkOptions(on_partial)        

    iterator = iter(iterable)
    while True:
        vals = []
        for i in range(n):
            try:
                vals.append(next(iterator))
            except StopIteration:
                if vals:
                    if on_partial == PartialChunkOptions.INCLUDE:
                        yield vals
                    elif on_partial == PartialChunkOptions.EXCLUDE:
                        pass
                    elif on_partial == PartialChunkOptions.PAD:
                        yield vals + [pad] * (n - len(vals))
                    elif on_partial == PartialChunkOptions.ERROR:
                        raise PartialChunkException
                    return
                return
        yield vals

测试.py

import chunker

chunk_size = 3

for it in (range(100, 107),
          range(100, 109)):

    print("\nITERABLE TO CHUNK: {}".format(it))
    print("CHUNK SIZE: {}".format(chunk_size))

    for option in chunker.PartialChunkOptions.__members__.values():
        print("\noption {} used".format(option))
        try:
            for chunk in chunker.chunker(it, chunk_size, on_partial=option):
                print(chunk)
        except chunker.PartialChunkException:
            print("PartialChunkException was raised")
    print("")

test.py的输出


ITERABLE TO CHUNK: range(100, 107)
CHUNK SIZE: 3

option PartialChunkOptions.INCLUDE used
[100, 101, 102]
[103, 104, 105]
[106]

option PartialChunkOptions.EXCLUDE used
[100, 101, 102]
[103, 104, 105]

option PartialChunkOptions.PAD used
[100, 101, 102]
[103, 104, 105]
[106, None, None]

option PartialChunkOptions.ERROR used
[100, 101, 102]
[103, 104, 105]
PartialChunkException was raised


ITERABLE TO CHUNK: range(100, 109)
CHUNK SIZE: 3

option PartialChunkOptions.INCLUDE used
[100, 101, 102]
[103, 104, 105]
[106, 107, 108]

option PartialChunkOptions.EXCLUDE used
[100, 101, 102]
[103, 104, 105]
[106, 107, 108]

option PartialChunkOptions.PAD used
[100, 101, 102]
[103, 104, 105]
[106, 107, 108]

option PartialChunkOptions.ERROR used
[100, 101, 102]
[103, 104, 105]
[106, 107, 108]

我很好奇不同方法的性能,这里是:

在Python 3.5.1上测试

import time
batch_size = 7
arr_len = 298937

#---------slice-------------

print("\r\nslice")
start = time.time()
arr = [i for i in range(0, arr_len)]
while True:
    if not arr:
        break

    tmp = arr[0:batch_size]
    arr = arr[batch_size:-1]
print(time.time() - start)

#-----------index-----------

print("\r\nindex")
arr = [i for i in range(0, arr_len)]
start = time.time()
for i in range(0, round(len(arr) / batch_size + 1)):
    tmp = arr[batch_size * i : batch_size * (i + 1)]
print(time.time() - start)

#----------batches 1------------

def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

print("\r\nbatches 1")
arr = [i for i in range(0, arr_len)]
start = time.time()
for x in batch(arr, batch_size):
    tmp = x
print(time.time() - start)

#----------batches 2------------

from itertools import islice, chain

def batch(iterable, size):
    sourceiter = iter(iterable)
    while True:
        batchiter = islice(sourceiter, size)
        yield chain([next(batchiter)], batchiter)


print("\r\nbatches 2")
arr = [i for i in range(0, arr_len)]
start = time.time()
for x in batch(arr, batch_size):
    tmp = x
print(time.time() - start)

#---------chunks-------------
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]
print("\r\nchunks")
arr = [i for i in range(0, arr_len)]
start = time.time()
for x in chunks(arr, batch_size):
    tmp = x
print(time.time() - start)

#-----------grouper-----------

from itertools import zip_longest # for Python 3.x
#from six.moves import zip_longest # for both (uses the six compat library)

def grouper(iterable, n, padvalue=None):
    "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
    return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)

arr = [i for i in range(0, arr_len)]
print("\r\ngrouper")
start = time.time()
for x in grouper(arr, batch_size):
    tmp = x
print(time.time() - start)

结果:

slice
31.18285083770752

index
0.02184295654296875

batches 1
0.03503894805908203

batches 2
0.22681021690368652

chunks
0.019841909408569336

grouper
0.006506919860839844