假设我有两个表,l1和l2。我想执行l1 - l2,返回l1中不在l2中的所有元素。

我可以想出一个简单的循环方法来做这个,但那真的很低效。python式的高效方法是什么?

举个例子,如果l1 = [1,2,6,8], l2 = [2,3,5,8], l1 - l2应该返回[1,6]


当前回答

通过利用字典的有序属性来维持顺序(Python 3.7+)

注意:Python 3.6中字典的参考实现按照插入顺序维护键,但规范不保证这一点。对于3.7及更高版本,添加了这个保证。

字典的键作为一种集合;重复项被隐式过滤掉,由于散列,查找是高效的。因此,我们可以通过使用l1作为键来构建字典,然后删除出现在l2中的任何键来实现“set difference”。这维持了秩序并使用了一种快速的算法,但会产生相当数量的常量开销。

d = dict.fromkeys(l1)
for i in l2:
    try:
        del d[i]
    except KeyError:
        pass
l3 = list(d.keys())

其他回答

一种方法是使用集合:

>>> set([1,2,6,8]) - set([2,3,5,8])
set([1, 6])

但是请注意,集合不会保留元素的顺序,并且会删除任何重复的元素。元素也需要是可哈希的。如果这些限制是可以容忍的,那么这通常是最简单和性能最高的选项。

使用Set推导式{x for x in l2}或Set (l2)来获取Set,然后使用List推导式来获取List

l2set = set(l2)
l3 = [x for x in l1 if x not in l2set]

基准测试代码:

import time

l1 = list(range(1000*10 * 3))
l2 = list(range(1000*10 * 2))

l2set = {x for x in l2}

tic = time.time()
l3 = [x for x in l1 if x not in l2set]
toc = time.time()
diffset = toc-tic
print(diffset)

tic = time.time()
l3 = [x for x in l1 if x not in l2]
toc = time.time()
difflist = toc-tic
print(difflist)

print("speedup %fx"%(difflist/diffset))

基准测试结果:

0.0015058517456054688
3.968189239501953
speedup 2635.179227x    

Python有一个称为列表推导式的语言特性,它非常适合使这类事情变得极其简单。下面的语句完全是你想要的,并将结果存储在l3中:

l3 = [x for x in l1 if x not in l2]

L3将包含[1,6]。

性能比较

在Python 3.9.1和Python 2.7.16上比较这里提到的所有答案的性能。

Python 3.9.1

答案按表现顺序列出:

Arkku's set difference using subtraction "-" operation - (91.3 nsec per loop) mquadri$ python3 -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1 - l2" 5000000 loops, best of 5: 91.3 nsec per loop Moinuddin Quadri's using set().difference()- (133 nsec per loop) mquadri$ python3 -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1.difference(l2)" 2000000 loops, best of 5: 133 nsec per loop Moinuddin Quadri's list comprehension with set based lookup- (366 nsec per loop) mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "[x for x in l1 if x not in l2]" 1000000 loops, best of 5: 366 nsec per loop Donut's list comprehension on plain list - (489 nsec per loop) mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "[x for x in l1 if x not in l2]" 500000 loops, best of 5: 489 nsec per loop Daniel Pryden's generator expression with set based lookup and type-casting to list - (583 nsec per loop) : Explicitly type-casting to list to get the final object as list, as requested by OP. If generator expression is replaced with list comprehension, it'll become same as Moinuddin Quadri's list comprehension with set based lookup. mquadri$ mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(x for x in l1 if x not in l2)" 500000 loops, best of 5: 583 nsec per loop Moinuddin Quadri's using filter() and explicitly type-casting to list (need to explicitly type-cast as in Python 3.x, it returns iterator) - (681 nsec per loop) mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filter(lambda x: x not in l2, l1))" 500000 loops, best of 5: 681 nsec per loop Akshay Hazari's using combination of functools.reduce + filter -(3.36 usec per loop) : Explicitly type-casting to list as from Python 3.x it started returned returning iterator. Also we need to import functools to use reduce in Python 3.x mquadri$ python3 -m timeit "from functools import reduce; l1 = [1,2,6,8]; l2 = [2,3,5,8];" "list(reduce(lambda x,y : filter(lambda z: z!=y,x) ,l1,l2))" 100000 loops, best of 5: 3.36 usec per loop

Python 2.7.16

答案按表现顺序列出:

Arkku's set difference using subtraction "-" operation - (0.0783 usec per loop) mquadri$ python -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1 - l2" 10000000 loops, best of 3: 0.0783 usec per loop Moinuddin Quadri's using set().difference()- (0.117 usec per loop) mquadri$ mquadri$ python -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1.difference(l2)" 10000000 loops, best of 3: 0.117 usec per loop Moinuddin Quadri's list comprehension with set based lookup- (0.246 usec per loop) mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "[x for x in l1 if x not in l2]" 1000000 loops, best of 3: 0.246 usec per loop Donut's list comprehension on plain list - (0.372 usec per loop) mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "[x for x in l1 if x not in l2]" 1000000 loops, best of 3: 0.372 usec per loop Moinuddin Quadri's using filter() - (0.593 usec per loop) mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "filter(lambda x: x not in l2, l1)" 1000000 loops, best of 3: 0.593 usec per loop Daniel Pryden's generator expression with set based lookup and type-casting to list - (0.964 per loop) : Explicitly type-casting to list to get the final object as list, as requested by OP. If generator expression is replaced with list comprehension, it'll become same as Moinuddin Quadri's list comprehension with set based lookup. mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(x for x in l1 if x not in l2)" 1000000 loops, best of 3: 0.964 usec per loop Akshay Hazari's using combination of functools.reduce + filter -(2.78 usec per loop) mquadri$ python -m timeit "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "reduce(lambda x,y : filter(lambda z: z!=y,x) ,l1,l2)" 100000 loops, best of 3: 2.78 usec per loop

使用set.difference ():

你可以使用set.difference()来获取新的set,其中包含集合中不存在于其他集合中的元素。即set(A).difference(B)将返回包含A中存在而B中不存在的项的set。例如:

>>> set([1,2,6,8]).difference([2,3,5,8])
{1, 6}

它是Arkku的答案中提到的一种函数方法(它使用算术减法-运算符来获得集差)。

因为集合是无序的,你将失去初始列表中元素的顺序。(如果你想保持元素的顺序,请继续阅读下一节)

使用列表理解和基于集的查找

如果你想保持从初始列表的顺序,那么甜甜圈的基于列表理解的答案将做的技巧。但是,通过在内部使用set来检查元素是否存在于其他列表中,可以从接受的答案中获得更好的性能。例如:

l1, l2 = [1,2,6,8], [2,3,5,8]
s2 = set(l2)  # Type-cast `l2` to `set`

l3 = [x for x in l1 if x not in s2]
                             #   ^ Doing membership checking on `set` s2

如果你有兴趣了解为什么与列表相比,set的成员资格检查更快,请阅读这篇文章:是什么使得set比list更快?


使用filter()和lambda表达式

下面是使用filter()和lambda表达式的另一种替代方法。在这里添加它只是为了参考,但它不是有效的性能:

>>> l1 = [1,2,6,8]
>>> l2 = set([2,3,5,8])

#     v  `filter` returns the a iterator object. Here I'm type-casting 
#     v  it to `list` in order to display the resultant value
>>> list(filter(lambda x: x not in l2, l1))
[1, 6]