如何在整数列表中找到重复项并创建重复项的另一个列表?


当前回答

试试这个检查副本

>>> def checkDuplicate(List):
    duplicate={}
    for i in List:
            ## checking whether the item is already present in dictionary or not
            ## increasing count if present
            ## initializing count to 1 if not present

        duplicate[i]=duplicate.get(i,0)+1

    return [k for k,v in duplicate.items() if v>1]

>>> checkDuplicate([1,2,3,"s",1,2,3])
[1, 2, 3]

其他回答

尽管它的复杂度是O(n log n),但这似乎有点竞争性,请参阅下面的基准测试。

a = sorted(a)
dupes = list(set(a[::2]) & set(a[1::2]))

排序会把副本放在一起,所以它们都在偶数下标和奇数下标处。唯一值只能在偶数或奇数下标处存在,不能同时存在。所以偶数下标值和奇数下标值的交集就是重复项。

基准测试结果:

这使用了MSeifert的基准测试,但只使用了从接受的答案(georgs)、最慢的解决方案、最快的解决方案(不包括it_duplcopies,因为它不唯一重复)和我的解决方案。否则就太拥挤了,颜色也太相似了。

如果允许修改给定的列表,那么第一行可以是a.sort(),这样会快一些。但是基准会多次重用相同的列表,因此修改它会打乱基准。

显然set(a[::2]).intersection(a[1::2])不会创建第二个集合,而且速度会快一点,但它也会长一点。

你可以使用iteration_utilities.duplicate:

>>> from iteration_utilities import duplicates

>>> list(duplicates([1,1,2,1,2,3,4,2]))
[1, 1, 2, 2]

或者如果你只想要一个副本,可以结合iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2])))
[1, 2]

它也可以处理不可哈希的元素(但是以性能为代价):

>>> list(duplicates([[1], [2], [1], [3], [1]]))
[[1], [1]]

>>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]])))
[[1]]

这是只有少数其他方法可以处理的问题。

基准

我做了一个快速的基准测试,其中包含了这里提到的大部分(但不是全部)方法。

第一个基准测试只包含了很小范围的列表长度,因为一些方法具有O(n**2)行为。

在图表中,y轴代表时间,所以值越低越好。它还绘制了log-log,以便更好地可视化广泛的值范围:

除去O(n**2)方法,我在一个列表中做了另一个多达50万个元素的基准测试:

正如您所看到的iteration_utilities。duplicate方法比任何其他方法都快,甚至连链接unique_everseen(duplicate(…))也比其他方法更快或同样快。

这里需要注意的另一件有趣的事情是,熊猫方法对于小列表非常慢,但可以轻松地竞争较长的列表。

然而,由于这些基准测试显示大多数方法的性能大致相同,因此使用哪一种并不重要(除了有O(n**2)运行时的3种方法)。

from iteration_utilities import duplicates, unique_everseen
from collections import Counter
import pandas as pd
import itertools

def georg_counter(it):
    return [item for item, count in Counter(it).items() if count > 1]

def georg_set(it):
    seen = set()
    uniq = []
    for x in it:
        if x not in seen:
            uniq.append(x)
            seen.add(x)

def georg_set2(it):
    seen = set()
    return [x for x in it if x not in seen and not seen.add(x)]   

def georg_set3(it):
    seen = {}
    dupes = []

    for x in it:
        if x not in seen:
            seen[x] = 1
        else:
            if seen[x] == 1:
                dupes.append(x)
            seen[x] += 1

def RiteshKumar_count(l):
    return set([x for x in l if l.count(x) > 1])

def moooeeeep(seq):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in seq if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def F1Rumors_implementation(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in zip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def F1Rumors(c):
    return list(F1Rumors_implementation(c))

def Edward(a):
    d = {}
    for elem in a:
        if elem in d:
            d[elem] += 1
        else:
            d[elem] = 1
    return [x for x, y in d.items() if y > 1]

def wordsmith(a):
    return pd.Series(a)[pd.Series(a).duplicated()].values

def NikhilPrabhu(li):
    li = li.copy()
    for x in set(li):
        li.remove(x)

    return list(set(li))

def firelynx(a):
    vc = pd.Series(a).value_counts()
    return vc[vc > 1].index.tolist()

def HenryDev(myList):
    newList = set()

    for i in myList:
        if myList.count(i) >= 2:
            newList.add(i)

    return list(newList)

def yota(number_lst):
    seen_set = set()
    duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
    return seen_set - duplicate_set

def IgorVishnevskiy(l):
    s=set(l)
    d=[]
    for x in l:
        if x in s:
            s.remove(x)
        else:
            d.append(x)
    return d

def it_duplicates(l):
    return list(duplicates(l))

def it_unique_duplicates(l):
    return list(unique_everseen(duplicates(l)))

基准1

from simple_benchmark import benchmark
import random

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, 
    F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx,
    HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)}

b = benchmark(funcs, args, 'list size')

b.plot()

基准2

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, 
    F1Rumors, Edward, wordsmith, firelynx,
    yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)}

b = benchmark(funcs, args, 'list size')
b.plot()

免责声明

1这是我写的一个第三方库:iteration_utilities。

list2 = [1, 2, 3, 4, 1, 2, 3]
lset = set()
[(lset.add(item), list2.append(item))
 for item in list2 if item not in lset]
print list(lset)

还有其他测试。当然要做……

set([x for x in l if l.count(x) > 1])

...代价太大了。使用下一个final方法大约快500倍(数组越长结果越好):

def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()

只有2个循环,没有非常昂贵的l.count()操作。

下面是一个比较方法的代码。代码如下,输出如下:

dups_count: 13.368s # this is a function which uses l.count()
dups_count_dict: 0.014s # this is a final best function (of the 3 functions)
dups_count_counter: 0.024s # collections.Counter

测试代码:

import numpy as np
from time import time
from collections import Counter

class TimerCounter(object):
    def __init__(self):
        self._time_sum = 0

    def start(self):
        self.time = time()

    def stop(self):
        self._time_sum += time() - self.time

    def get_time_sum(self):
        return self._time_sum


def dups_count(l):
    return set([x for x in l if l.count(x) > 1])


def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()


def dups_counter(l):
    counter = Counter(l)    

    result_d = {key: val for key, val in counter.iteritems() if val > 1}

    return result_d.keys()



def gen_array():
    np.random.seed(17)
    return list(np.random.randint(0, 5000, 10000))


def assert_equal_results(*results):
    primary_result = results[0]
    other_results = results[1:]

    for other_result in other_results:
        assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result)


if __name__ == '__main__':
    dups_count_time = TimerCounter()
    dups_count_dict_time = TimerCounter()
    dups_count_counter = TimerCounter()

    l = gen_array()

    for i in range(3):
        dups_count_time.start()
        result1 = dups_count(l)
        dups_count_time.stop()

        dups_count_dict_time.start()
        result2 = dups_count_dict(l)
        dups_count_dict_time.stop()

        dups_count_counter.start()
        result3 = dups_counter(l)
        dups_count_counter.stop()

        assert_equal_results(result1, result2, result3)

    print 'dups_count: %.3f' % dups_count_time.get_time_sum()
    print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum()
    print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()

一句话解决方案:

set([i for i in list if sum([1 for a in list if a == i]) > 1])