就Python中的性能而言,列表理解或map(), filter()和reduce()等函数是否比for循环更快?为什么,从技术上讲,它们以C速度运行,而for循环以python虚拟机速度运行?
假设在我正在开发的游戏中,我需要使用for循环绘制复杂而巨大的地图。这个问题肯定是相关的,例如,如果一个列表理解确实更快,为了避免延迟,它将是一个更好的选择(尽管代码的视觉复杂性)。
就Python中的性能而言,列表理解或map(), filter()和reduce()等函数是否比for循环更快?为什么,从技术上讲,它们以C速度运行,而for循环以python虚拟机速度运行?
假设在我正在开发的游戏中,我需要使用for循环绘制复杂而巨大的地图。这个问题肯定是相关的,例如,如果一个列表理解确实更快,为了避免延迟,它将是一个更好的选择(尽管代码的视觉复杂性)。
如果你检查python.org上的信息,你可以看到这个摘要:
Version Time (seconds)
Basic loop 3.47
Eliminate dots 2.45
Local variable & no dots 1.79
Using map function 0.54
但您确实应该详细阅读上面的文章,以了解性能差异的原因。
我还强烈建议您使用timeit来计时。最后,可能会出现这样的情况,例如,当满足条件时,您可能需要跳出for循环。这可能比调用map得到结果更快。
以下是基于经验的粗略指南和有根据的猜测。您应该对具体用例进行计时或分析以获得确切的数字,而这些数字可能偶尔与下面的数字不一致。
列表推导式通常比精确等效的for循环(实际上构建了一个列表)快一点点,很可能是因为它不需要在每次迭代时查找列表及其append方法。然而,列表推导式仍然执行字节码级循环:
>>> dis.dis(<the code object for `[x for x in range(10)]`>)
1 0 BUILD_LIST 0
3 LOAD_FAST 0 (.0)
>> 6 FOR_ITER 12 (to 21)
9 STORE_FAST 1 (x)
12 LOAD_FAST 1 (x)
15 LIST_APPEND 2
18 JUMP_ABSOLUTE 6
>> 21 RETURN_VALUE
使用列表推导式代替不构建列表的循环,无意义地积累一个无意义值的列表,然后将列表丢弃,这通常较慢,因为创建和扩展列表的开销很大。列表推导并不是天生就比旧的循环快的魔法。
As for functional list processing functions: While these are written in C and probably outperform equivalent functions written in Python, they are not necessarily the fastest option. Some speed up is expected if the function is written in C too. But most cases using a lambda (or other Python function), the overhead of repeatedly setting up Python stack frames etc. eats up any savings. Simply doing the same work in-line, without function calls (e.g. a list comprehension instead of map or filter) is often slightly faster.
假设在我正在开发的游戏中,我需要使用for循环绘制复杂而巨大的地图。这个问题肯定是相关的,例如,如果一个列表理解确实更快,为了避免延迟,它将是一个更好的选择(尽管代码的视觉复杂性)。
Chances are, if code like this isn't already fast enough when written in good non-"optimized" Python, no amount of Python level micro optimization is going to make it fast enough and you should start thinking about dropping to C. While extensive micro optimizations can often speed up Python code considerably, there is a low (in absolute terms) limit to this. Moreover, even before you hit that ceiling, it becomes simply more cost efficient (15% speedup vs. 300% speed up with the same effort) to bite the bullet and write some C.
您特别询问了map()、filter()和reduce(),但我假设您想了解一般的函数式编程。我亲自测试了计算一组点中所有点之间距离的问题,函数式编程(使用内置itertools模块中的starmap函数)比for-loops稍微慢一点(实际上需要1.25倍的时间)。下面是我使用的示例代码:
import itertools, time, math, random
class Point:
def __init__(self,x,y):
self.x, self.y = x, y
point_set = (Point(0, 0), Point(0, 1), Point(0, 2), Point(0, 3))
n_points = 100
pick_val = lambda : 10 * random.random() - 5
large_set = [Point(pick_val(), pick_val()) for _ in range(n_points)]
# the distance function
f_dist = lambda x0, x1, y0, y1: math.sqrt((x0 - x1) ** 2 + (y0 - y1) ** 2)
# go through each point, get its distance from all remaining points
f_pos = lambda p1, p2: (p1.x, p2.x, p1.y, p2.y)
extract_dists = lambda x: itertools.starmap(f_dist,
itertools.starmap(f_pos,
itertools.combinations(x, 2)))
print('Distances:', list(extract_dists(point_set)))
t0_f = time.time()
list(extract_dists(large_set))
dt_f = time.time() - t0_f
函数版本是否比过程版本更快?
def extract_dists_procedural(pts):
n_pts = len(pts)
l = []
for k_p1 in range(n_pts - 1):
for k_p2 in range(k_p1, n_pts):
l.append((pts[k_p1].x - pts[k_p2].x) ** 2 +
(pts[k_p1].y - pts[k_p2].y) ** 2)
return l
t0_p = time.time()
list(extract_dists_procedural(large_set))
# using list() on the assumption that
# it eats up as much time as in the functional version
dt_p = time.time() - t0_p
f_vs_p = dt_p / dt_f
if f_vs_p >= 1.0:
print('Time benefit of functional progamming:', f_vs_p,
'times as fast for', n_points, 'points')
else:
print('Time penalty of functional programming:', 1 / f_vs_p,
'times as slow for', n_points, 'points')
我写了一个简单的脚本来测试速度,这是我发现的。实际上在我的例子中,for循环是最快的。这真的让我很惊讶,看看下面(正在计算平方和)。
from functools import reduce
import datetime
def time_it(func, numbers, *args):
start_t = datetime.datetime.now()
for i in range(numbers):
func(args[0])
print (datetime.datetime.now()-start_t)
def square_sum1(numbers):
return reduce(lambda sum, next: sum+next**2, numbers, 0)
def square_sum2(numbers):
a = 0
for i in numbers:
i = i**2
a += i
return a
def square_sum3(numbers):
sqrt = lambda x: x**2
return sum(map(sqrt, numbers))
def square_sum4(numbers):
return(sum([int(i)**2 for i in numbers]))
time_it(square_sum1, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum2, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum3, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum4, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
0:00:00.302000 #Reduce
0:00:00.144000 #For loop
0:00:00.318000 #Map
0:00:00.390000 #List comprehension
在Alphii的答案中添加一个扭曲,实际上for循环是第二好的,大约比map慢6倍
from functools import reduce
import datetime
def time_it(func, numbers, *args):
start_t = datetime.datetime.now()
for i in range(numbers):
func(args[0])
print (datetime.datetime.now()-start_t)
def square_sum1(numbers):
return reduce(lambda sum, next: sum+next**2, numbers, 0)
def square_sum2(numbers):
a = 0
for i in numbers:
a += i**2
return a
def square_sum3(numbers):
a = 0
map(lambda x: a+x**2, numbers)
return a
def square_sum4(numbers):
a = 0
return [a+i**2 for i in numbers]
time_it(square_sum1, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum2, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum3, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum4, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
主要的变化是消除了缓慢的sum调用,以及在最后一种情况下可能不必要的int()。将for循环和map放在同一个术语中,实际上使它变得非常真实。记住lambda是函数概念,理论上不应该有副作用,但是,它们可能有副作用,比如添加到。 在这种情况下,使用Python 3.6.1, Ubuntu 14.04, Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
0:00:00.257703 #Reduce
0:00:00.184898 #For loop
0:00:00.031718 #Map
0:00:00.212699 #List comprehension
我已经设法修改了一些@alpiii的代码,并发现List理解比for循环快一点。它可能是由int()引起的,在列表理解和for循环之间是不公平的。
from functools import reduce
import datetime
def time_it(func, numbers, *args):
start_t = datetime.datetime.now()
for i in range(numbers):
func(args[0])
print (datetime.datetime.now()-start_t)
def square_sum1(numbers):
return reduce(lambda sum, next: sum+next*next, numbers, 0)
def square_sum2(numbers):
a = []
for i in numbers:
a.append(i*2)
a = sum(a)
return a
def square_sum3(numbers):
sqrt = lambda x: x*x
return sum(map(sqrt, numbers))
def square_sum4(numbers):
return(sum([i*i for i in numbers]))
time_it(square_sum1, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum2, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum3, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum4, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
0:00:00.101122 #Reduce
0:00:00.089216 #For loop
0:00:00.101532 #Map
0:00:00.068916 #List comprehension
我修改了@Alisa的代码,并使用cProfile来展示为什么列表理解更快:
from functools import reduce
import datetime
def reduce_(numbers):
return reduce(lambda sum, next: sum + next * next, numbers, 0)
def for_loop(numbers):
a = []
for i in numbers:
a.append(i*2)
a = sum(a)
return a
def map_(numbers):
sqrt = lambda x: x*x
return sum(map(sqrt, numbers))
def list_comp(numbers):
return(sum([i*i for i in numbers]))
funcs = [
reduce_,
for_loop,
map_,
list_comp
]
if __name__ == "__main__":
# [1, 2, 5, 3, 1, 2, 5, 3]
import cProfile
for f in funcs:
print('=' * 25)
print("Profiling:", f.__name__)
print('=' * 25)
pr = cProfile.Profile()
for i in range(10**6):
pr.runcall(f, [1, 2, 5, 3, 1, 2, 5, 3])
pr.create_stats()
pr.print_stats()
结果如下:
=========================
Profiling: reduce_
=========================
11000000 function calls in 1.501 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1000000 0.162 0.000 1.473 0.000 profiling.py:4(reduce_)
8000000 0.461 0.000 0.461 0.000 profiling.py:5(<lambda>)
1000000 0.850 0.000 1.311 0.000 {built-in method _functools.reduce}
1000000 0.028 0.000 0.028 0.000 {method 'disable' of '_lsprof.Profiler' objects}
=========================
Profiling: for_loop
=========================
11000000 function calls in 1.372 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1000000 0.879 0.000 1.344 0.000 profiling.py:7(for_loop)
1000000 0.145 0.000 0.145 0.000 {built-in method builtins.sum}
8000000 0.320 0.000 0.320 0.000 {method 'append' of 'list' objects}
1000000 0.027 0.000 0.027 0.000 {method 'disable' of '_lsprof.Profiler' objects}
=========================
Profiling: map_
=========================
11000000 function calls in 1.470 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1000000 0.264 0.000 1.442 0.000 profiling.py:14(map_)
8000000 0.387 0.000 0.387 0.000 profiling.py:15(<lambda>)
1000000 0.791 0.000 1.178 0.000 {built-in method builtins.sum}
1000000 0.028 0.000 0.028 0.000 {method 'disable' of '_lsprof.Profiler' objects}
=========================
Profiling: list_comp
=========================
4000000 function calls in 0.737 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1000000 0.318 0.000 0.709 0.000 profiling.py:18(list_comp)
1000000 0.261 0.000 0.261 0.000 profiling.py:19(<listcomp>)
1000000 0.131 0.000 0.131 0.000 {built-in method builtins.sum}
1000000 0.027 0.000 0.027 0.000 {method 'disable' of '_lsprof.Profiler' objects}
恕我直言:
Reduce和map通常都很慢。不仅如此,在map返回的迭代器上使用sum函数比在列表中使用sum函数要慢 For_loop使用append,这在某种程度上当然比较慢 列表理解不仅花费最少的时间构建列表,而且与map相比,它还使sum更快
我正在寻找一些关于“for”循环和“列表理解”的性能信息,偶然发现了这个主题。 距离Python 3.11发布(2022年10月)已经过去了几个月,Python 3.11的主要特性之一是速度的提高。 https://www.python.org/downloads/release/python-3110/
更快的CPython项目已经产生了一些令人兴奋的结果。Python 3.11比Python 3.10快10-60%。平均而言,我们在标准基准测试套件上测试到1.22倍的加速。详见Faster CPython。
我运行了最初由Alphi发布,然后被jjmerelo“扭曲”的相同代码。Python3.10和Python3.11的结果如下:
from functools import reduce
import datetime
def time_it(func, numbers, *args):
start_t = datetime.datetime.now()
for i in range(numbers):
func(args[0])
print(datetime.datetime.now()-start_t)
def square_sum1(numbers):
return reduce(lambda sum, next: sum+next**2, numbers, 0)
def square_sum2(numbers):
a = 0
for i in numbers:
a += i**2
return a
def square_sum3(numbers):
a = 0
map(lambda x: a+x**2, numbers)
return a
def square_sum4(numbers):
a = 0
return [a+i**2 for i in numbers]
time_it(square_sum1, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum2, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum3, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum4, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
我还没有计算出确切的改进百分比,但很明显,性能增益-至少在这个特定的实例中-似乎是令人印象深刻的(快3到4倍),除了'map',它的性能改进可以忽略不计。
#Python 3.10
0:00:00.221134 #Reduce
0:00:00.186307 #For
0:00:00.024311 #Map
0:00:00.206454 #List comprehension
#python3.11
0:00:00.072550 #Reduce
0:00:00.037168 #For
0:00:00.021702 #Map
0:00:00.058655 #List Comprehension
注意:我使用WSL在运行于Windows 11下的Kali Linux虚拟机上运行此程序。我不确定这段代码在Linux实例上本机(裸机)运行是否会执行得更好。
我的Kali Linux虚拟机规格如下:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-6700T CPU @ 2.80GHz
CPU family: 6
Model: 94
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 3
BogoMIPS: 5615.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 128 KiB (4 instances)
L1i cache: 128 KiB (4 instances)
L2 cache: 1 MiB (4 instances)
L3 cache: 8 MiB (1 instance)
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Unknown: Dependent on hypervisor status
Vulnerability Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown