在Python中对正则表达式使用compile有什么好处吗?
h = re.compile('hello')
h.match('hello world')
vs
re.match('hello', 'hello world')
在Python中对正则表达式使用compile有什么好处吗?
h = re.compile('hello')
h.match('hello world')
vs
re.match('hello', 'hello world')
当前回答
Ubuntu 22.04:
$ python --version
Python 3.10.6
$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 're.match("[0-9]{3}-[0-9]{3}-[0-9]{4}", "123-123-1234")'; done
1 loop, best of 5: 972 nsec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (186 usec) was more than four times slower than the best time (972 nsec).
10 loops, best of 5: 819 nsec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (13.9 usec) was more than four times slower than the best time (819 nsec).
100 loops, best of 5: 763 nsec per loop
1000 loops, best of 5: 699 nsec per loop
10000 loops, best of 5: 653 nsec per loop
100000 loops, best of 5: 655 nsec per loop
1000000 loops, best of 5: 656 nsec per loop
$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 'r = re.compile("[0-9]{3}-[0-9]{3}-[0-9]{4}")' 'r.match("123-123-1234")'; done
1 loop, best of 5: 985 nsec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (134 usec) was more than four times slower than the best time (985 nsec).
10 loops, best of 5: 775 nsec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (13.9 usec) was more than four times slower than the best time (775 nsec).
100 loops, best of 5: 756 nsec per loop
1000 loops, best of 5: 701 nsec per loop
10000 loops, best of 5: 704 nsec per loop
100000 loops, best of 5: 654 nsec per loop
1000000 loops, best of 5: 651 nsec per loop
其他回答
尽管这两种方法在速度方面是可以比较的,但是您应该知道,如果您正在处理数百万次迭代,那么仍然存在一些可以忽略不计的时间差。
以下速度测试:
import re
import time
SIZE = 100_000_000
start = time.time()
foo = re.compile('foo')
[foo.search('bar') for _ in range(SIZE)]
print('compiled: ', time.time() - start)
start = time.time()
[re.search('foo', 'bar') for _ in range(SIZE)]
print('uncompiled:', time.time() - start)
给出了以下结果:
compiled: 14.647532224655151
uncompiled: 61.483458042144775
编译后的方法在我的PC上(使用Python 3.7.0)始终快大约4倍。
如文档中所述:
如果在循环中访问正则表达式,预编译它将节省一些函数调用。在循环之外,由于内部缓存,没有太大区别。
对我来说,re.compile的最大好处是能够将正则表达式的定义与其使用分开。
即使是一个简单的表达式,如0|[1-9][0-9]*(以10为基数,不带前导零的整数),也可能非常复杂,以至于您宁愿不重新输入它,检查是否有任何拼写错误,然后在开始调试时重新检查是否有拼写错误。另外,使用像num或num_b10这样的变量名比0|[1-9][0-9]*更好。
当然可以存储字符串并将它们传递给re.match;然而,这就不那么容易读了:
num = "..."
# then, much later:
m = re.match(num, input)
与编译:
num = re.compile("...")
# then, much later:
m = num.match(input)
虽然它很接近,但当重复使用时,第二句的最后一行感觉更自然、更简单。
我有很多运行一个编译过的正则表达式和实时编译的经验,并没有注意到任何可感知的差异。显然,这只是传闻,当然也不是反对编译的有力论据,但我发现两者之间的差异可以忽略不计。
编辑: 在快速浏览了实际的Python 2.5库代码后,我发现无论何时使用正则表达式(包括调用re.match()), Python都会在内部编译和缓存正则表达式,因此实际上只在正则表达式被编译时进行更改,并且不应该节省太多时间——只节省检查缓存所需的时间(对内部dict类型的键查找)。
来自re.py模块(评论是我的):
def match(pattern, string, flags=0):
return _compile(pattern, flags).match(string)
def _compile(*key):
# Does cache check at top of function
cachekey = (type(key[0]),) + key
p = _cache.get(cachekey)
if p is not None: return p
# ...
# Does actual compilation on cache miss
# ...
# Caches compiled regex
if len(_cache) >= _MAXCACHE:
_cache.clear()
_cache[cachekey] = p
return p
我仍然经常预编译正则表达式,但只是为了将它们绑定到一个漂亮的、可重用的名称,而不是为了任何预期的性能提升。
大多数情况下,是否使用re.compile没有什么区别。在内部,所有函数都是按照编译步骤实现的:
def match(pattern, string, flags=0):
return _compile(pattern, flags).match(string)
def fullmatch(pattern, string, flags=0):
return _compile(pattern, flags).fullmatch(string)
def search(pattern, string, flags=0):
return _compile(pattern, flags).search(string)
def sub(pattern, repl, string, count=0, flags=0):
return _compile(pattern, flags).sub(repl, string, count)
def subn(pattern, repl, string, count=0, flags=0):
return _compile(pattern, flags).subn(repl, string, count)
def split(pattern, string, maxsplit=0, flags=0):
return _compile(pattern, flags).split(string, maxsplit)
def findall(pattern, string, flags=0):
return _compile(pattern, flags).findall(string)
def finditer(pattern, string, flags=0):
return _compile(pattern, flags).finditer(string)
此外,re.compile()绕过了额外的间接和缓存逻辑:
_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 512
def _compile(pattern, flags):
# internal: compile pattern
try:
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
if flags:
raise ValueError(
"cannot process flags argument with a compiled pattern")
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if not (flags & DEBUG):
if len(_cache) >= _MAXCACHE:
_cache.clear()
if p.flags & LOCALE:
if not _locale:
return p
loc = _locale.setlocale(_locale.LC_CTYPE)
else:
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p
除了使用re.compile带来的小速度好处外,人们还喜欢命名潜在复杂的模式规范并将其与应用的业务逻辑分离所带来的可读性:
#### Patterns ############################################################
number_pattern = re.compile(r'\d+(\.\d*)?') # Integer or decimal number
assign_pattern = re.compile(r':=') # Assignment operator
identifier_pattern = re.compile(r'[A-Za-z]+') # Identifiers
whitespace_pattern = re.compile(r'[\t ]+') # Spaces and tabs
#### Applications ########################################################
if whitespace_pattern.match(s): business_logic_rule_1()
if assign_pattern.match(s): business_logic_rule_2()
注意,另一位受访者错误地认为pyc文件直接存储已编译的模式;然而,在现实中,每次PYC加载时,它们都会被重新构建:
>>> from dis import dis
>>> with open('tmp.pyc', 'rb') as f:
f.read(8)
dis(marshal.load(f))
1 0 LOAD_CONST 0 (-1)
3 LOAD_CONST 1 (None)
6 IMPORT_NAME 0 (re)
9 STORE_NAME 0 (re)
3 12 LOAD_NAME 0 (re)
15 LOAD_ATTR 1 (compile)
18 LOAD_CONST 2 ('[aeiou]{2,5}')
21 CALL_FUNCTION 1
24 STORE_NAME 2 (lc_vowels)
27 LOAD_CONST 1 (None)
30 RETURN_VALUE
上面的分解来自于一个包含tmp.py的PYC文件:
import re
lc_vowels = re.compile(r'[aeiou]{2,5}')
我自己刚试过。对于从字符串中解析数字并对其求和的简单情况,使用编译后的正则表达式对象的速度大约是使用re方法的两倍。
正如其他人指出的那样,re方法(包括re.compile)在以前编译的表达式缓存中查找正则表达式字符串。因此,在正常情况下,使用re方法的额外成本只是缓存查找的成本。
然而,检查代码,缓存被限制为100个表达式。这就引出了一个问题,缓存溢出有多痛苦?该代码包含正则表达式编译器的内部接口re.sre_compile.compile。如果我们调用它,就绕过了缓存。结果表明,对于一个基本的正则表达式,例如r'\w+\s+([0-9_]+)\s+\w*',它要慢两个数量级。
下面是我的测试:
#!/usr/bin/env python
import re
import time
def timed(func):
def wrapper(*args):
t = time.time()
result = func(*args)
t = time.time() - t
print '%s took %.3f seconds.' % (func.func_name, t)
return result
return wrapper
regularExpression = r'\w+\s+([0-9_]+)\s+\w*'
testString = "average 2 never"
@timed
def noncompiled():
a = 0
for x in xrange(1000000):
m = re.match(regularExpression, testString)
a += int(m.group(1))
return a
@timed
def compiled():
a = 0
rgx = re.compile(regularExpression)
for x in xrange(1000000):
m = rgx.match(testString)
a += int(m.group(1))
return a
@timed
def reallyCompiled():
a = 0
rgx = re.sre_compile.compile(regularExpression)
for x in xrange(1000000):
m = rgx.match(testString)
a += int(m.group(1))
return a
@timed
def compiledInLoop():
a = 0
for x in xrange(1000000):
rgx = re.compile(regularExpression)
m = rgx.match(testString)
a += int(m.group(1))
return a
@timed
def reallyCompiledInLoop():
a = 0
for x in xrange(10000):
rgx = re.sre_compile.compile(regularExpression)
m = rgx.match(testString)
a += int(m.group(1))
return a
r1 = noncompiled()
r2 = compiled()
r3 = reallyCompiled()
r4 = compiledInLoop()
r5 = reallyCompiledInLoop()
print "r1 = ", r1
print "r2 = ", r2
print "r3 = ", r3
print "r4 = ", r4
print "r5 = ", r5
</pre>
And here is the output on my machine:
<pre>
$ regexTest.py
noncompiled took 4.555 seconds.
compiled took 2.323 seconds.
reallyCompiled took 2.325 seconds.
compiledInLoop took 4.620 seconds.
reallyCompiledInLoop took 4.074 seconds.
r1 = 2000000
r2 = 2000000
r3 = 2000000
r4 = 2000000
r5 = 20000
'reallyCompiled'方法使用内部接口,绕过缓存。注意,在每个循环迭代中编译的代码只迭代了10,000次,而不是一百万次。