在Python中删除列表中的重复字典

如果使用第三方包是可以的，那么你可以使用iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen
>>> l = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> list(unique_everseen(l))
[{'a': 123}, {'b': 123}]

它保留了原始列表的顺序，并且ut还可以通过采用较慢的算法(O(n*m)，其中n是原始列表中的元素，m是原始列表中唯一的元素，而不是O(n))来处理字典等不可哈希项。如果键和值都是可哈希的，你可以使用该函数的key参数来为“唯一性测试”创建可哈希的项(这样它就可以在O(n)中工作)。

在字典的情况下(它的比较独立于顺序)，你需要将它映射到另一个数据结构，这样比较，例如frozenset:

>>> list(unique_everseen(l, key=lambda item: frozenset(item.items())))
[{'a': 123}, {'b': 123}]

注意，你不应该使用简单的元组方法(没有排序)，因为相等的字典不一定有相同的顺序(即使在Python 3.7中，插入顺序-而不是绝对顺序-是有保证的):

>>> d1 = {1: 1, 9: 9}
>>> d2 = {9: 9, 1: 1}
>>> d1 == d2
True
>>> tuple(d1.items()) == tuple(d2.items())
False

如果键不可排序，即使对元组进行排序也可能不起作用:

>>> d3 = {1: 1, 'a': 'a'}
>>> tuple(sorted(d3.items()))
TypeError: '<' not supported between instances of 'str' and 'int'

基准

我认为比较一下这些方法的性能可能会有用，所以我做了一个小的基准测试。基准图是基于不包含重复项的列表的时间与列表大小(该列表是任意选择的，如果添加一些或大量重复项，运行时不会发生显著变化)。这是一个对数对数图，所以涵盖了整个范围。

绝对时间:

与最快方法相关的时间:

The second approach from thefourtheye is fastest here. The unique_everseen approach with the key function is on the second place, however it's the fastest approach that preserves order. The other approaches from jcollado and thefourtheye are almost as fast. The approach using unique_everseen without key and the solutions from Emmanuel and Scorpil are very slow for longer lists and behave much worse O(n*n) instead of O(n). stpks approach with json isn't O(n*n) but it's much slower than the similar O(n) approaches.

重现基准测试的代码:

from simple_benchmark import benchmark
import json
from collections import OrderedDict
from iteration_utilities import unique_everseen

def jcollado_1(l):
    return [dict(t) for t in {tuple(d.items()) for d in l}]

def jcollado_2(l):
    seen = set()
    new_l = []
    for d in l:
        t = tuple(d.items())
        if t not in seen:
            seen.add(t)
            new_l.append(d)
    return new_l

def Emmanuel(d):
    return [i for n, i in enumerate(d) if i not in d[n + 1:]]

def Scorpil(a):
    b = []
    for i in range(0, len(a)):
        if a[i] not in a[i+1:]:
            b.append(a[i])

def stpk(X):
    set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
    return [json.loads(t) for t in set_of_jsons]

def thefourtheye_1(data):
    return OrderedDict((frozenset(item.items()),item) for item in data).values()

def thefourtheye_2(data):
    return {frozenset(item.items()):item for item in data}.values()

def iu_1(l):
    return list(unique_everseen(l))

def iu_2(l):
    return list(unique_everseen(l, key=lambda inner_dict: frozenset(inner_dict.items())))

funcs = (jcollado_1, Emmanuel, stpk, Scorpil, thefourtheye_1, thefourtheye_2, iu_1, jcollado_2, iu_2)
arguments = {2**i: [{'a': j} for j in range(2**i)] for i in range(2, 12)}
b = benchmark(funcs, arguments, 'list size')

%matplotlib widget
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = '8, 6'

b.plot(relative_to=thefourtheye_2)

为了完整起见，这里是只包含重复项的列表的计时:

# this is the only change for the benchmark
arguments = {2**i: [{'a': 1} for j in range(2**i)] for i in range(2, 12)}

除了没有键函数的unique_everseen之外，计时没有显著变化，在本例中，unique_everseen是最快的解决方案。然而，对于具有不可哈希值的函数来说，这只是最佳情况(所以不具有代表性)，因为它的运行时间取决于列表中唯一值的数量:O(n*m)，在这种情况下仅为1，因此它运行在O(n)。

免责声明:我是iteration_utilities的作者。

2018-07-17 19:43:56

如果使用第三方包是可以的，那么你可以使用iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen
>>> l = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> list(unique_everseen(l))
[{'a': 123}, {'b': 123}]

它保留了原始列表的顺序，并且ut还可以通过采用较慢的算法(O(n*m)，其中n是原始列表中的元素，m是原始列表中唯一的元素，而不是O(n))来处理字典等不可哈希项。如果键和值都是可哈希的，你可以使用该函数的key参数来为“唯一性测试”创建可哈希的项(这样它就可以在O(n)中工作)。

在字典的情况下(它的比较独立于顺序)，你需要将它映射到另一个数据结构，这样比较，例如frozenset:

>>> list(unique_everseen(l, key=lambda item: frozenset(item.items())))
[{'a': 123}, {'b': 123}]

注意，你不应该使用简单的元组方法(没有排序)，因为相等的字典不一定有相同的顺序(即使在Python 3.7中，插入顺序-而不是绝对顺序-是有保证的):

>>> d1 = {1: 1, 9: 9}
>>> d2 = {9: 9, 1: 1}
>>> d1 == d2
True
>>> tuple(d1.items()) == tuple(d2.items())
False

如果键不可排序，即使对元组进行排序也可能不起作用:

>>> d3 = {1: 1, 'a': 'a'}
>>> tuple(sorted(d3.items()))
TypeError: '<' not supported between instances of 'str' and 'int'

基准

我认为比较一下这些方法的性能可能会有用，所以我做了一个小的基准测试。基准图是基于不包含重复项的列表的时间与列表大小(该列表是任意选择的，如果添加一些或大量重复项，运行时不会发生显著变化)。这是一个对数对数图，所以涵盖了整个范围。

绝对时间:

与最快方法相关的时间:

The second approach from thefourtheye is fastest here. The unique_everseen approach with the key function is on the second place, however it's the fastest approach that preserves order. The other approaches from jcollado and thefourtheye are almost as fast. The approach using unique_everseen without key and the solutions from Emmanuel and Scorpil are very slow for longer lists and behave much worse O(n*n) instead of O(n). stpks approach with json isn't O(n*n) but it's much slower than the similar O(n) approaches.

重现基准测试的代码:

from simple_benchmark import benchmark
import json
from collections import OrderedDict
from iteration_utilities import unique_everseen

def jcollado_1(l):
    return [dict(t) for t in {tuple(d.items()) for d in l}]

def jcollado_2(l):
    seen = set()
    new_l = []
    for d in l:
        t = tuple(d.items())
        if t not in seen:
            seen.add(t)
            new_l.append(d)
    return new_l

def Emmanuel(d):
    return [i for n, i in enumerate(d) if i not in d[n + 1:]]

def Scorpil(a):
    b = []
    for i in range(0, len(a)):
        if a[i] not in a[i+1:]:
            b.append(a[i])

def stpk(X):
    set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
    return [json.loads(t) for t in set_of_jsons]

def thefourtheye_1(data):
    return OrderedDict((frozenset(item.items()),item) for item in data).values()

def thefourtheye_2(data):
    return {frozenset(item.items()):item for item in data}.values()

def iu_1(l):
    return list(unique_everseen(l))

def iu_2(l):
    return list(unique_everseen(l, key=lambda inner_dict: frozenset(inner_dict.items())))

funcs = (jcollado_1, Emmanuel, stpk, Scorpil, thefourtheye_1, thefourtheye_2, iu_1, jcollado_2, iu_2)
arguments = {2**i: [{'a': j} for j in range(2**i)] for i in range(2, 12)}
b = benchmark(funcs, arguments, 'list size')

%matplotlib widget
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = '8, 6'

b.plot(relative_to=thefourtheye_2)

为了完整起见，这里是只包含重复项的列表的计时:

# this is the only change for the benchmark
arguments = {2**i: [{'a': 1} for j in range(2**i)] for i in range(2, 12)}

除了没有键函数的unique_everseen之外，计时没有显著变化，在本例中，unique_everseen是最快的解决方案。然而，对于具有不可哈希值的函数来说，这只是最佳情况(所以不具有代表性)，因为它的运行时间取决于列表中唯一值的数量:O(n*m)，在这种情况下仅为1，因此它运行在O(n)。

免责声明:我是iteration_utilities的作者。

2018-07-17 19:43:56

下面是一个带有双嵌套列表理解的快速单行解决方案(基于@Emmanuel的解决方案)。

它使用每个字典中的单个键(例如，a)作为主键，而不是检查整个字典是否匹配

[i for n, i in enumerate(list_of_dicts) if i.get(primary_key) not in [y.get(primary_key) for y in list_of_dicts[n + 1:]]]

这不是OP要求的，但这是让我来到这个帖子的原因，所以我想我应该发布我最终得到的解决方案

2020-02-14 06:37:18

有很多搜索重复值和键的好例子，下面是我们在列表中过滤整个字典重复数据的方法。如果您的源数据是由EXACT格式的字典组成并寻找重复项，请使用dupKeys =[]。否则设置dupKeys =为您希望没有重复条目的数据的键名，可以是1到n个键。它不优雅，但工作和非常灵活

import binascii

collected_sensor_data = [{"sensor_id":"nw-180","data":"XXXXXXX"},
                         {"sensor_id":"nw-163","data":"ZYZYZYY"},
                         {"sensor_id":"nw-180","data":"XXXXXXX"},
                         {"sensor_id":"nw-97", "data":"QQQQQZZ"}]

dupKeys = ["sensor_id", "data"]

def RemoveDuplicateDictData(collected_sensor_data, dupKeys):

    checkCRCs = []
    final_sensor_data = []
    
    if dupKeys == []:
        for sensor_read in collected_sensor_data:
            ck1 = binascii.crc32(str(sensor_read).encode('utf8'))
            if not ck1 in checkCRCs:
                final_sensor_data.append(sensor_read)
                checkCRCs.append(ck1)
    else:
        for sensor_read in collected_sensor_data:
            tmp = ""
            for k in dupKeys:
                tmp += str(sensor_read[k])

            ck1 = binascii.crc32(tmp.encode('utf8'))
            if not ck1 in checkCRCs:
                final_sensor_data.append(sensor_read)
                checkCRCs.append(ck1)
  
           
    return final_sensor_data    

 final_sensor_data = [{"sensor_id":"nw-180","data":"XXXXXXX"},
                      {"sensor_id":"nw-163","data":"ZYZYZYY"},
                      {"sensor_id":"nw-97", "data":"QQQQQZZ"}]

2021-02-17 16:42:39

最简单的方法是将列表中的每一项转换为字符串，因为字典是不可哈希的。然后可以使用set删除重复项。

list_org = [{'a': 123}, {'b': 123}, {'a': 123}]
list_org_updated = [ str(item) for item in list_org]
print(list_org_updated)
["{'a': 123}", "{'b': 123}", "{'a': 123}"]
unique_set = set(list_org_updated)
print(unique_set)
{"{'b': 123}", "{'a': 123}"}

你可以使用集合，但如果你想要一个列表，那么添加以下内容:

import ast
unique_list = [ast.literal_eval(item) for item in unique_set]
print(unique_list)
[{'b': 123}, {'a': 123}]

2021-02-21 02:21:22

如果你想维护骑士团，那你可以这么做

from collections import OrderedDict
print OrderedDict((frozenset(item.items()),item) for item in data).values()
# [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

如果顺序不重要，那么你可以这样做

print {frozenset(item.items()):item for item in data}.values()
# [{'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

2014-04-29 07:52:59

在Python中删除列表中的重复字典

推荐文章

最新文章

标签