如何在整数列表中找到重复项并创建重复项的另一个列表?


当前回答

我注意到大多数解决方案的复杂度为O(n * n),对于大型列表来说非常缓慢。所以我想分享一下我写的函数,它支持整数或字符串,在最好的情况下是O(n)。对于一个包含10万个元素的列表,最上面的解决方案需要超过30秒,而我的解决方案只需0.12秒

def get_duplicates(list1):
    '''Return all duplicates given a list. O(n) complexity for best case scenario.
    input: [1, 1, 1, 2, 3, 4, 4]
    output: [1, 1, 4]
    '''
    dic = {}
    for el in list1:
        try:
            dic[el] += 1
        except:
            dic[el] = 1
    dupes = []
    for key in dic.keys():
        for i in range(dic[key] - 1):
            dupes.append(key)
    return dupes


list1 = [1, 1, 1, 2, 3, 4, 4]
> print(get_duplicates(list1))
[1, 1, 4]

或者获得唯一的副本:

> print(list(set(get_duplicates(list1))))
[1, 4]

其他回答

如果你不关心自己编写算法或使用库,Python 3.8一行代码:

l = [1,2,3,2,1,5,6,5,5,5]

res = [(x, count) for x, g in groupby(sorted(l)) if (count := len(list(g))) > 1]

print(res)

打印项目和计数:

[(1, 2), (2, 2), (5, 4)]

groupby接受一个分组函数,因此您可以以不同的方式定义分组,并根据需要返回额外的Tuple字段。

我会用熊猫做这个,因为我经常用熊猫

import pandas as pd
a = [1,2,3,3,3,4,5,6,6,7]
vc = pd.Series(a).value_counts()
vc[vc > 1].index.tolist()

给了

[3,6]

可能不是很有效,但它肯定比许多其他答案的代码更少,所以我想我可以贡献一下

有点晚了,但可能对一些人有帮助。 对于一个比较大的列表,我发现这个方法很适合我。

l=[1,2,3,5,4,1,3,1]
s=set(l)
d=[]
for x in l:
    if x in s:
        s.remove(x)
    else:
        d.append(x)
d
[1,3,1]

显示正确和所有重复,并保持秩序。

还有其他测试。当然要做……

set([x for x in l if l.count(x) > 1])

...代价太大了。使用下一个final方法大约快500倍(数组越长结果越好):

def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()

只有2个循环,没有非常昂贵的l.count()操作。

下面是一个比较方法的代码。代码如下,输出如下:

dups_count: 13.368s # this is a function which uses l.count()
dups_count_dict: 0.014s # this is a final best function (of the 3 functions)
dups_count_counter: 0.024s # collections.Counter

测试代码:

import numpy as np
from time import time
from collections import Counter

class TimerCounter(object):
    def __init__(self):
        self._time_sum = 0

    def start(self):
        self.time = time()

    def stop(self):
        self._time_sum += time() - self.time

    def get_time_sum(self):
        return self._time_sum


def dups_count(l):
    return set([x for x in l if l.count(x) > 1])


def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()


def dups_counter(l):
    counter = Counter(l)    

    result_d = {key: val for key, val in counter.iteritems() if val > 1}

    return result_d.keys()



def gen_array():
    np.random.seed(17)
    return list(np.random.randint(0, 5000, 10000))


def assert_equal_results(*results):
    primary_result = results[0]
    other_results = results[1:]

    for other_result in other_results:
        assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result)


if __name__ == '__main__':
    dups_count_time = TimerCounter()
    dups_count_dict_time = TimerCounter()
    dups_count_counter = TimerCounter()

    l = gen_array()

    for i in range(3):
        dups_count_time.start()
        result1 = dups_count(l)
        dups_count_time.stop()

        dups_count_dict_time.start()
        result2 = dups_count_dict(l)
        dups_count_dict_time.stop()

        dups_count_counter.start()
        result3 = dups_counter(l)
        dups_count_counter.stop()

        assert_equal_results(result1, result2, result3)

    print 'dups_count: %.3f' % dups_count_time.get_time_sum()
    print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum()
    print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()

不需要转换为列表,可能最简单的方法是如下所示。 在面试中,当他们要求不要使用集合时,这可能会很有用

a=[1,2,3,3,3]
dup=[]
for each in a:
  if each not in dup:
    dup.append(each)
print(dup)

======= else获取唯一值和重复值的2个单独列表

a=[1,2,3,3,3]
uniques=[]
dups=[]

for each in a:
  if each not in uniques:
    uniques.append(each)
  else:
    dups.append(each)
print("Unique values are below:")
print(uniques)
print("Duplicate values are below:")
print(dups)