假设我有两个表,l1和l2。我想执行l1 - l2,返回l1中不在l2中的所有元素。
我可以想出一个简单的循环方法来做这个,但那真的很低效。python式的高效方法是什么?
举个例子,如果l1 = [1,2,6,8], l2 = [2,3,5,8], l1 - l2应该返回[1,6]
假设我有两个表,l1和l2。我想执行l1 - l2,返回l1中不在l2中的所有元素。
我可以想出一个简单的循环方法来做这个,但那真的很低效。python式的高效方法是什么?
举个例子,如果l1 = [1,2,6,8], l2 = [2,3,5,8], l1 - l2应该返回[1,6]
当前回答
备选方案:
reduce(lambda x,y : filter(lambda z: z!=y,x) ,[2,3,5,8],[1,2,6,8])
其他回答
Python 3.8上的集合和列表理解基准
(加起来就是Moinuddin Quadri的基准)
tldr:使用Arkku的集合解决方案,相比之下比承诺的还要快!
根据列表检查现有文件
在我的例子中,我发现使用Arkku的集合解决方案要比python的列表理解方法快40倍(!)。
列表理解:
%%time
import glob
existing = [int(os.path.basename(x).split(".")[0]) for x in glob.glob("*.txt")]
wanted = list(range(1, 100000))
[i for i in wanted if i not in existing]
墙壁时间:28.2秒
Sets
%%time
import glob
existing = [int(os.path.basename(x).split(".")[0]) for x in glob.glob("*.txt")]
wanted = list(range(1, 100000))
set(wanted) - set(existing)
壁时间:689毫秒
使用Set推导式{x for x in l2}或Set (l2)来获取Set,然后使用List推导式来获取List
l2set = set(l2)
l3 = [x for x in l1 if x not in l2set]
基准测试代码:
import time
l1 = list(range(1000*10 * 3))
l2 = list(range(1000*10 * 2))
l2set = {x for x in l2}
tic = time.time()
l3 = [x for x in l1 if x not in l2set]
toc = time.time()
diffset = toc-tic
print(diffset)
tic = time.time()
l3 = [x for x in l1 if x not in l2]
toc = time.time()
difflist = toc-tic
print(difflist)
print("speedup %fx"%(difflist/diffset))
基准测试结果:
0.0015058517456054688
3.968189239501953
speedup 2635.179227x
使用filterfalse而不使用lambda-expression
When using functions like filter or filterfalse and similar from itertools you can usually save performance by avoiding lambda-expressions and using already existing functions. Instances of list and set defines a __contains__-method to use for containment checks. The in-operator calls this method under the hood, so using x in l2 can be replaced by l2.__contains__(x). Usually this replacement is not really prettier but in this specific case it allows us to gain better performance than using a lambda-expression, when used in combination with filterfalse:
>>> from itertools import filterfalse
>>> l1 = [1, 2, 6, 8]
>>> l2 = [2, 3, 5, 8]
>>> list(filterfalse(l2.__contains__, l1))
[1, 6]
Filterfalse创建一个迭代器,该迭代器生成的所有元素在用作12.2 .__contains__的参数时返回false。
Sets有一个更快的__contains__实现,所以更好的是:
>>> from itertools import filterfalse
>>> l1 = [1, 2, 6, 8]
>>> l2 = set([2, 3, 5, 8])
>>> list(filterfalse(l2.__contains__, l1))
[1, 6]
性能
使用列表:
$ python3 -m timeit -s "from itertools import filterfalse; l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filterfalse(l2.__contains__, l1))"
500000 loops, best of 5: 522 nsec per loop
使用设置:
$ python3 -m timeit -s "from itertools import filterfalse; l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filterfalse(l2.__contains__, l1))"
1000000 loops, best of 5: 359 nsec per loop
性能比较
在Python 3.9.1和Python 2.7.16上比较这里提到的所有答案的性能。
Python 3.9.1
答案按表现顺序列出:
Arkku's set difference using subtraction "-" operation - (91.3 nsec per loop) mquadri$ python3 -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1 - l2" 5000000 loops, best of 5: 91.3 nsec per loop Moinuddin Quadri's using set().difference()- (133 nsec per loop) mquadri$ python3 -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1.difference(l2)" 2000000 loops, best of 5: 133 nsec per loop Moinuddin Quadri's list comprehension with set based lookup- (366 nsec per loop) mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "[x for x in l1 if x not in l2]" 1000000 loops, best of 5: 366 nsec per loop Donut's list comprehension on plain list - (489 nsec per loop) mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "[x for x in l1 if x not in l2]" 500000 loops, best of 5: 489 nsec per loop Daniel Pryden's generator expression with set based lookup and type-casting to list - (583 nsec per loop) : Explicitly type-casting to list to get the final object as list, as requested by OP. If generator expression is replaced with list comprehension, it'll become same as Moinuddin Quadri's list comprehension with set based lookup. mquadri$ mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(x for x in l1 if x not in l2)" 500000 loops, best of 5: 583 nsec per loop Moinuddin Quadri's using filter() and explicitly type-casting to list (need to explicitly type-cast as in Python 3.x, it returns iterator) - (681 nsec per loop) mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filter(lambda x: x not in l2, l1))" 500000 loops, best of 5: 681 nsec per loop Akshay Hazari's using combination of functools.reduce + filter -(3.36 usec per loop) : Explicitly type-casting to list as from Python 3.x it started returned returning iterator. Also we need to import functools to use reduce in Python 3.x mquadri$ python3 -m timeit "from functools import reduce; l1 = [1,2,6,8]; l2 = [2,3,5,8];" "list(reduce(lambda x,y : filter(lambda z: z!=y,x) ,l1,l2))" 100000 loops, best of 5: 3.36 usec per loop
Python 2.7.16
答案按表现顺序列出:
Arkku's set difference using subtraction "-" operation - (0.0783 usec per loop) mquadri$ python -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1 - l2" 10000000 loops, best of 3: 0.0783 usec per loop Moinuddin Quadri's using set().difference()- (0.117 usec per loop) mquadri$ mquadri$ python -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1.difference(l2)" 10000000 loops, best of 3: 0.117 usec per loop Moinuddin Quadri's list comprehension with set based lookup- (0.246 usec per loop) mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "[x for x in l1 if x not in l2]" 1000000 loops, best of 3: 0.246 usec per loop Donut's list comprehension on plain list - (0.372 usec per loop) mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "[x for x in l1 if x not in l2]" 1000000 loops, best of 3: 0.372 usec per loop Moinuddin Quadri's using filter() - (0.593 usec per loop) mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "filter(lambda x: x not in l2, l1)" 1000000 loops, best of 3: 0.593 usec per loop Daniel Pryden's generator expression with set based lookup and type-casting to list - (0.964 per loop) : Explicitly type-casting to list to get the final object as list, as requested by OP. If generator expression is replaced with list comprehension, it'll become same as Moinuddin Quadri's list comprehension with set based lookup. mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(x for x in l1 if x not in l2)" 1000000 loops, best of 3: 0.964 usec per loop Akshay Hazari's using combination of functools.reduce + filter -(2.78 usec per loop) mquadri$ python -m timeit "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "reduce(lambda x,y : filter(lambda z: z!=y,x) ,l1,l2)" 100000 loops, best of 3: 2.78 usec per loop
备选方案:
reduce(lambda x,y : filter(lambda z: z!=y,x) ,[2,3,5,8],[1,2,6,8])