如何在整数列表中找到重复项并创建重复项的另一个列表?


当前回答

假设我们有这个元素列表:

a = [1, 2, 3, 2, 1, 5, 6, 5, 5, 5]

我们可以只使用集合来找到唯一的元素:

unique = set()
for num in a:
    if num not in unique:
        unique.add(num)
    else:
        unique = unique - set([num])

最后:

>>> unique
{3, 6}

如果你想要得到副本,你可以简单地做:

>>> duplicates = set(a) - unique
>>> duplicates
{1, 2, 5}

注:

集合中的元素查找是O(1) 从集合中移除的元素是O(1)

其他回答

还有其他测试。当然要做……

set([x for x in l if l.count(x) > 1])

...代价太大了。使用下一个final方法大约快500倍(数组越长结果越好):

def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()

只有2个循环,没有非常昂贵的l.count()操作。

下面是一个比较方法的代码。代码如下,输出如下:

dups_count: 13.368s # this is a function which uses l.count()
dups_count_dict: 0.014s # this is a final best function (of the 3 functions)
dups_count_counter: 0.024s # collections.Counter

测试代码:

import numpy as np
from time import time
from collections import Counter

class TimerCounter(object):
    def __init__(self):
        self._time_sum = 0

    def start(self):
        self.time = time()

    def stop(self):
        self._time_sum += time() - self.time

    def get_time_sum(self):
        return self._time_sum


def dups_count(l):
    return set([x for x in l if l.count(x) > 1])


def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()


def dups_counter(l):
    counter = Counter(l)    

    result_d = {key: val for key, val in counter.iteritems() if val > 1}

    return result_d.keys()



def gen_array():
    np.random.seed(17)
    return list(np.random.randint(0, 5000, 10000))


def assert_equal_results(*results):
    primary_result = results[0]
    other_results = results[1:]

    for other_result in other_results:
        assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result)


if __name__ == '__main__':
    dups_count_time = TimerCounter()
    dups_count_dict_time = TimerCounter()
    dups_count_counter = TimerCounter()

    l = gen_array()

    for i in range(3):
        dups_count_time.start()
        result1 = dups_count(l)
        dups_count_time.stop()

        dups_count_dict_time.start()
        result2 = dups_count_dict(l)
        dups_count_dict_time.stop()

        dups_count_counter.start()
        result3 = dups_counter(l)
        dups_count_counter.stop()

        assert_equal_results(result1, result2, result3)

    print 'dups_count: %.3f' % dups_count_time.get_time_sum()
    print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum()
    print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()

第三个接受答案的例子给出了一个错误的答案,并且没有试图给出重复的答案。下面是正确的版本:

number_lst = [1, 1, 2, 3, 5, ...]

seen_set = set()
duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
unique_set = seen_set - duplicate_set

集合。Counter是python 2.7中的新功能:


Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print [x for x, y in collections.Counter(a).items() if y > 1]
Type "help", "copyright", "credits" or "license" for more information.
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'Counter'
>>> 

在早期版本中,你可以使用传统的字典:

a = [1,2,3,2,1,5,6,5,5,5]
d = {}
for elem in a:
    if elem in d:
        d[elem] += 1
    else:
        d[elem] = 1

print [x for x, y in d.items() if y > 1]

另一种解决方案如下所示,不使用任何集合库。

a = [1,2,3,5,4,6,4,21,4,6,3,32,5,2,23,5]
duplicates = []

for i in a:
    if a.count(i) > 1 and i not in duplicates:
        duplicates.append(i)

print(duplicates)

输出是[2,3,5,4,6]

下面是一个快速生成器,它使用dict将每个元素存储为一个带有布尔值的键,用于检查是否已经产生了重复项。

对于所有元素都是可哈希类型的列表:

def gen_dupes(array):
    unique = {}
    for value in array:
        if value in unique and unique[value]:
            unique[value] = False
            yield value
        else:
            unique[value] = True

array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, 1, 6]

对于可能包含列表的列表:

def gen_dupes(array):
    unique = {}
    for value in array:
        is_list = False
        if type(value) is list:
            value = tuple(value)
            is_list = True

        if value in unique and unique[value]:
            unique[value] = False
            if is_list:
                value = list(value)

            yield value
        else:
            unique[value] = True

array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, [1, 2], 6]