如何在整数列表中找到重复项并创建重复项的另一个列表?
当前回答
在没有任何python数据结构的帮助下,你可以简单地尝试下面的代码。这将工作于寻找重复的各种输入,如字符串,列表等。
# finding duplicates in unsorted an array
def duplicates(numbers):
store=[]
checked=[]
for i in range(len(numbers)):
counter =1
for j in range(i+1,len(numbers)):
if numbers[i] not in checked and numbers[j]==numbers[i] :
counter +=1
if counter > 1 :
store.append(numbers[i])
checked.append(numbers[i])
return store
print(duplicates([1,2,2,3,3,3,4,4,5])) # output: [2, 3, 4]
print(duplicates("madam")) # output: ['m', 'a']
其他回答
some_list = ['a', 'b', 'c', 'b', 'd', 'm', 'n', 'n']
some_dictionary = {}
for element in some_list:
if element not in some_dictionary:
some_dictionary[element] = 1
else:
some_dictionary[element] += 1
for key, value in some_dictionary.items():
if value > 1:
print(key, end = ' ')
# another way
duplicates = []
for x in some_list:
if some_list.count(x) > 1 and x not in duplicates:
duplicates.append(x)
print()
print(duplicates)
来源:这里
下面是一个快速生成器,它使用dict将每个元素存储为一个带有布尔值的键,用于检查是否已经产生了重复项。
对于所有元素都是可哈希类型的列表:
def gen_dupes(array):
unique = {}
for value in array:
if value in unique and unique[value]:
unique[value] = False
yield value
else:
unique[value] = True
array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, 1, 6]
对于可能包含列表的列表:
def gen_dupes(array):
unique = {}
for value in array:
is_list = False
if type(value) is list:
value = tuple(value)
is_list = True
if value in unique and unique[value]:
unique[value] = False
if is_list:
value = list(value)
yield value
else:
unique[value] = True
array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, [1, 2], 6]
你不需要计数,只需要该物品之前是否被看到过。把这个答案用在这个问题上:
def list_duplicates(seq):
seen = set()
seen_add = seen.add
# adds all elements it doesn't know yet to seen and all other to seen_twice
seen_twice = set( x for x in seq if x in seen or seen_add(x) )
# turn the set into a list (as requested)
return list( seen_twice )
a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]
以防速度很重要,这里有一些时间安排:
# file: test.py
import collections
def thg435(l):
return [x for x, y in collections.Counter(l).items() if y > 1]
def moooeeeep(l):
seen = set()
seen_add = seen.add
# adds all elements it doesn't know yet to seen and all other to seen_twice
seen_twice = set( x for x in l if x in seen or seen_add(x) )
# turn the set into a list (as requested)
return list( seen_twice )
def RiteshKumar(l):
return list(set([x for x in l if l.count(x) > 1]))
def JohnLaRooy(L):
seen = set()
seen2 = set()
seen_add = seen.add
seen2_add = seen2.add
for item in L:
if item in seen:
seen2_add(item)
else:
seen_add(item)
return list(seen2)
l = [1,2,3,2,1,5,6,5,5,5]*100
以下是结果:(做得好@JohnLaRooy!)
$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
10000 loops, best of 3: 74.6 usec per loop
$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 91.3 usec per loop
$ python -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 266 usec per loop
$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'
100 loops, best of 3: 8.35 msec per loop
有趣的是,除了计时本身,当使用pypy时,排名也略有变化。最有趣的是,基于counter的方法极大地受益于pypy的优化,而我建议的方法缓存方法似乎几乎没有任何效果。
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
100000 loops, best of 3: 17.8 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
10000 loops, best of 3: 23 usec per loop
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 39.3 usec per loop
显然,这种效应与输入数据的“重复性”有关。我设置了l = [random.randrange(1000000) for I in xrange(10000)],得到了这些结果:
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
1000 loops, best of 3: 495 usec per loop
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
1000 loops, best of 3: 499 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 1.68 msec per loop
要删除重复项,请使用集合(a)。要打印副本,可以这样做:
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print([item for item, count in collections.Counter(a).items() if count > 1])
## [1, 2, 5]
请注意Counter并不是特别有效(计时),可能会在这里过度使用。Set会表现得更好。这段代码以源顺序计算一个唯一元素的列表:
seen = set()
uniq = []
for x in a:
if x not in seen:
uniq.append(x)
seen.add(x)
或者,更简洁地说:
seen = set()
uniq = [x for x in a if x not in seen and not seen.add(x)]
我不推荐后一种风格,因为它不清楚not seen.add(x)在做什么(set add()方法总是返回None,因此需要not)。
计算没有库的重复元素列表:
seen = set()
dupes = []
for x in a:
if x in seen:
dupes.append(x)
else:
seen.add(x)
或者,更简洁地说:
seen = set()
dupes = [x for x in a if x in seen or seen.add(x)]
如果列表元素不可哈希,则不能使用set /dicts,必须使用二次时间解决方案(逐个比较)。例如:
a = [[1], [2], [3], [1], [5], [3]]
no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
print no_dupes # [[1], [2], [3], [5]]
dupes = [x for n, x in enumerate(a) if x in a[:n]]
print dupes # [[1], [3]]
我注意到大多数解决方案的复杂度为O(n * n),对于大型列表来说非常缓慢。所以我想分享一下我写的函数,它支持整数或字符串,在最好的情况下是O(n)。对于一个包含10万个元素的列表,最上面的解决方案需要超过30秒,而我的解决方案只需0.12秒
def get_duplicates(list1):
'''Return all duplicates given a list. O(n) complexity for best case scenario.
input: [1, 1, 1, 2, 3, 4, 4]
output: [1, 1, 4]
'''
dic = {}
for el in list1:
try:
dic[el] += 1
except:
dic[el] = 1
dupes = []
for key in dic.keys():
for i in range(dic[key] - 1):
dupes.append(key)
return dupes
list1 = [1, 1, 1, 2, 3, 4, 4]
> print(get_duplicates(list1))
[1, 1, 4]
或者获得唯一的副本:
> print(list(set(get_duplicates(list1))))
[1, 4]
推荐文章
- 证书验证失败:无法获得本地颁发者证书
- 当使用pip3安装包时,“Python中的ssl模块不可用”
- 无法切换Python与pyenv
- Python if not == vs if !=
- 如何从scikit-learn决策树中提取决策规则?
- 为什么在Mac OS X v10.9 (Mavericks)的终端中apt-get功能不起作用?
- 将旋转的xtick标签与各自的xtick对齐
- 为什么元组可以包含可变项?
- 如何合并字典的字典?
- 如何创建类属性?
- 不区分大小写的“in”
- 在Python中获取迭代器中的元素个数
- 解析日期字符串并更改格式
- 使用try和。Python中的if
- 如何在Python中获得所有直接子目录