在Python中删除列表中的重复字典

我有一个字典列表，我想删除字典具有相同的键和值对。

这个列表:[{a: 123}, {b: 123}, {a: 123}]

我想返回这个:[{'a': 123}， {'b': 123}]

另一个例子:

这个列表:[{' a ': 123, ' b ': 1234}, {' a ': 3222, ' b ': 1234}, {' a ': 123, ' b ': 1234}]

我想退回这:[{' a ': 123, ' b ': 1234}, {' a ': 3222, ' b ': 1234}]

当前回答

如果您在工作流中使用Pandas，一种选择是直接向pd提供字典列表。DataFrame构造函数。然后使用drop_duplicate和to_dict方法获得所需的结果。

import pandas as pd

d = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

d_unique = pd.DataFrame(d).drop_duplicates().to_dict('records')

print(d_unique)

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

2018-08-01 13:34:58

其他回答

不是一个通用的答案，但如果你的列表恰好是按某个键排序的，像这样:

l=[{'a': {'b': 31}, 't': 1},
   {'a': {'b': 31}, 't': 1},
 {'a': {'b': 145}, 't': 2},
 {'a': {'b': 25231}, 't': 2},
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 112}, 't': 3}]

那么解决方案很简单:

import itertools
result = [a[0] for a in itertools.groupby(l)]

结果:

[{'a': {'b': 31}, 't': 1},
{'a': {'b': 145}, 't': 2},
{'a': {'b': 25231}, 't': 2},
{'a': {'b': 112}, 't': 3}]

使用嵌套字典并(显然)保持顺序。

2018-06-14 07:49:36

试试这个:

[dict(t) for t in {tuple(d.items()) for d in l}]

策略是将字典列表转换为元组列表，其中元组包含字典的项。由于元组可以散列，您可以使用set(此处使用set理解，旧的python替代方法是set(tuple(d.s items()) for d in l))删除重复项，然后使用dict从元组重新创建字典。

地点:

L是原始列表 D是列表中的一个字典 T是从字典中创建的元组之一

编辑:如果你想保持顺序，上面的一行代码将不起作用，因为set不会这样做。然而，用几行代码，你也可以做到这一点:

l = [{'a': 123, 'b': 1234},
        {'a': 3222, 'b': 1234},
        {'a': 123, 'b': 1234}]

seen = set()
new_l = []
for d in l:
    t = tuple(d.items())
    if t not in seen:
        seen.add(t)
        new_l.append(d)

print new_l

示例输出:

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

注意:正如@alexis指出的那样，两个具有相同键和值的字典可能不会产生相同的元组。如果他们经历了不同的添加/删除键历史记录，就可能发生这种情况。如果这是您的问题，那么考虑按照他的建议对d.s items()进行排序。

2012-02-24 07:51:31

如果使用第三方包是可以的，那么你可以使用iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen
>>> l = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> list(unique_everseen(l))
[{'a': 123}, {'b': 123}]

它保留了原始列表的顺序，并且ut还可以通过采用较慢的算法(O(n*m)，其中n是原始列表中的元素，m是原始列表中唯一的元素，而不是O(n))来处理字典等不可哈希项。如果键和值都是可哈希的，你可以使用该函数的key参数来为“唯一性测试”创建可哈希的项(这样它就可以在O(n)中工作)。

在字典的情况下(它的比较独立于顺序)，你需要将它映射到另一个数据结构，这样比较，例如frozenset:

>>> list(unique_everseen(l, key=lambda item: frozenset(item.items())))
[{'a': 123}, {'b': 123}]

注意，你不应该使用简单的元组方法(没有排序)，因为相等的字典不一定有相同的顺序(即使在Python 3.7中，插入顺序-而不是绝对顺序-是有保证的):

>>> d1 = {1: 1, 9: 9}
>>> d2 = {9: 9, 1: 1}
>>> d1 == d2
True
>>> tuple(d1.items()) == tuple(d2.items())
False

如果键不可排序，即使对元组进行排序也可能不起作用:

>>> d3 = {1: 1, 'a': 'a'}
>>> tuple(sorted(d3.items()))
TypeError: '<' not supported between instances of 'str' and 'int'

基准

我认为比较一下这些方法的性能可能会有用，所以我做了一个小的基准测试。基准图是基于不包含重复项的列表的时间与列表大小(该列表是任意选择的，如果添加一些或大量重复项，运行时不会发生显著变化)。这是一个对数对数图，所以涵盖了整个范围。

绝对时间:

与最快方法相关的时间:

The second approach from thefourtheye is fastest here. The unique_everseen approach with the key function is on the second place, however it's the fastest approach that preserves order. The other approaches from jcollado and thefourtheye are almost as fast. The approach using unique_everseen without key and the solutions from Emmanuel and Scorpil are very slow for longer lists and behave much worse O(n*n) instead of O(n). stpks approach with json isn't O(n*n) but it's much slower than the similar O(n) approaches.

重现基准测试的代码:

from simple_benchmark import benchmark
import json
from collections import OrderedDict
from iteration_utilities import unique_everseen

def jcollado_1(l):
    return [dict(t) for t in {tuple(d.items()) for d in l}]

def jcollado_2(l):
    seen = set()
    new_l = []
    for d in l:
        t = tuple(d.items())
        if t not in seen:
            seen.add(t)
            new_l.append(d)
    return new_l

def Emmanuel(d):
    return [i for n, i in enumerate(d) if i not in d[n + 1:]]

def Scorpil(a):
    b = []
    for i in range(0, len(a)):
        if a[i] not in a[i+1:]:
            b.append(a[i])

def stpk(X):
    set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
    return [json.loads(t) for t in set_of_jsons]

def thefourtheye_1(data):
    return OrderedDict((frozenset(item.items()),item) for item in data).values()

def thefourtheye_2(data):
    return {frozenset(item.items()):item for item in data}.values()

def iu_1(l):
    return list(unique_everseen(l))

def iu_2(l):
    return list(unique_everseen(l, key=lambda inner_dict: frozenset(inner_dict.items())))

funcs = (jcollado_1, Emmanuel, stpk, Scorpil, thefourtheye_1, thefourtheye_2, iu_1, jcollado_2, iu_2)
arguments = {2**i: [{'a': j} for j in range(2**i)] for i in range(2, 12)}
b = benchmark(funcs, arguments, 'list size')

%matplotlib widget
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = '8, 6'

b.plot(relative_to=thefourtheye_2)

为了完整起见，这里是只包含重复项的列表的计时:

# this is the only change for the benchmark
arguments = {2**i: [{'a': 1} for j in range(2**i)] for i in range(2, 12)}

除了没有键函数的unique_everseen之外，计时没有显著变化，在本例中，unique_everseen是最快的解决方案。然而，对于具有不可哈希值的函数来说，这只是最佳情况(所以不具有代表性)，因为它的运行时间取决于列表中唯一值的数量:O(n*m)，在这种情况下仅为1，因此它运行在O(n)。

免责声明:我是iteration_utilities的作者。

2018-07-17 19:43:56

如果你不关心规模和疯狂的性能，简单的func:

# Filters dicts with the same value in unique_key
# in: [{'k1': 1}, {'k1': 33}, {'k1': 1}]
# out: [{'k1': 1}, {'k1': 33}]
def remove_dup_dicts(list_of_dicts: list, unique_key) -> list:
    unique_values = list()
    unique_dicts = list()
    for obj in list_of_dicts:
        val = obj.get(unique_key)
        if val not in unique_values:
            unique_values.append(val)
            unique_dicts.append(obj)
    return unique_dicts

2022-03-13 14:29:10

不是很短，但很容易读:

list_of_data = [{'a': 123}, {'b': 123}, {'a': 123}]

list_of_data_uniq = []
for data in list_of_data:
    if data not in list_of_data_uniq:
        list_of_data_uniq.append(data)

现在，列表list_of_data_uniq将拥有唯一的字典。

2019-11-17 09:59:11

在Python中删除列表中的重复字典

推荐文章

最新文章

标签