如果使用第三方包是可以的,那么你可以使用iteration_utilities.unique_everseen:
>>> from iteration_utilities import unique_everseen
>>> l = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> list(unique_everseen(l))
[{'a': 123}, {'b': 123}]
它保留了原始列表的顺序,并且ut还可以通过采用较慢的算法(O(n*m),其中n是原始列表中的元素,m是原始列表中唯一的元素,而不是O(n))来处理字典等不可哈希项。如果键和值都是可哈希的,你可以使用该函数的key参数来为“唯一性测试”创建可哈希的项(这样它就可以在O(n)中工作)。
在字典的情况下(它的比较独立于顺序),你需要将它映射到另一个数据结构,这样比较,例如frozenset:
>>> list(unique_everseen(l, key=lambda item: frozenset(item.items())))
[{'a': 123}, {'b': 123}]
注意,你不应该使用简单的元组方法(没有排序),因为相等的字典不一定有相同的顺序(即使在Python 3.7中,插入顺序-而不是绝对顺序-是有保证的):
>>> d1 = {1: 1, 9: 9}
>>> d2 = {9: 9, 1: 1}
>>> d1 == d2
True
>>> tuple(d1.items()) == tuple(d2.items())
False
如果键不可排序,即使对元组进行排序也可能不起作用:
>>> d3 = {1: 1, 'a': 'a'}
>>> tuple(sorted(d3.items()))
TypeError: '<' not supported between instances of 'str' and 'int'
基准
我认为比较一下这些方法的性能可能会有用,所以我做了一个小的基准测试。基准图是基于不包含重复项的列表的时间与列表大小(该列表是任意选择的,如果添加一些或大量重复项,运行时不会发生显著变化)。这是一个对数对数图,所以涵盖了整个范围。
绝对时间:
与最快方法相关的时间:
The second approach from thefourtheye is fastest here. The unique_everseen approach with the key function is on the second place, however it's the fastest approach that preserves order. The other approaches from jcollado and thefourtheye are almost as fast. The approach using unique_everseen without key and the solutions from Emmanuel and Scorpil are very slow for longer lists and behave much worse O(n*n) instead of O(n). stpks approach with json isn't O(n*n) but it's much slower than the similar O(n) approaches.
重现基准测试的代码:
from simple_benchmark import benchmark
import json
from collections import OrderedDict
from iteration_utilities import unique_everseen
def jcollado_1(l):
return [dict(t) for t in {tuple(d.items()) for d in l}]
def jcollado_2(l):
seen = set()
new_l = []
for d in l:
t = tuple(d.items())
if t not in seen:
seen.add(t)
new_l.append(d)
return new_l
def Emmanuel(d):
return [i for n, i in enumerate(d) if i not in d[n + 1:]]
def Scorpil(a):
b = []
for i in range(0, len(a)):
if a[i] not in a[i+1:]:
b.append(a[i])
def stpk(X):
set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
return [json.loads(t) for t in set_of_jsons]
def thefourtheye_1(data):
return OrderedDict((frozenset(item.items()),item) for item in data).values()
def thefourtheye_2(data):
return {frozenset(item.items()):item for item in data}.values()
def iu_1(l):
return list(unique_everseen(l))
def iu_2(l):
return list(unique_everseen(l, key=lambda inner_dict: frozenset(inner_dict.items())))
funcs = (jcollado_1, Emmanuel, stpk, Scorpil, thefourtheye_1, thefourtheye_2, iu_1, jcollado_2, iu_2)
arguments = {2**i: [{'a': j} for j in range(2**i)] for i in range(2, 12)}
b = benchmark(funcs, arguments, 'list size')
%matplotlib widget
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = '8, 6'
b.plot(relative_to=thefourtheye_2)
为了完整起见,这里是只包含重复项的列表的计时:
# this is the only change for the benchmark
arguments = {2**i: [{'a': 1} for j in range(2**i)] for i in range(2, 12)}
除了没有键函数的unique_everseen之外,计时没有显著变化,在本例中,unique_everseen是最快的解决方案。然而,对于具有不可哈希值的函数来说,这只是最佳情况(所以不具有代表性),因为它的运行时间取决于列表中唯一值的数量:O(n*m),在这种情况下仅为1,因此它运行在O(n)。
免责声明:我是iteration_utilities的作者。