在Python中对正则表达式使用compile有什么好处吗?
h = re.compile('hello')
h.match('hello world')
vs
re.match('hello', 'hello world')
在Python中对正则表达式使用compile有什么好处吗?
h = re.compile('hello')
h.match('hello world')
vs
re.match('hello', 'hello world')
当前回答
下面是一个使用re.compile的示例,在请求时速度超过50倍。
这一点与我在上面的评论中所说的是一样的,即当您的使用从编译缓存中获益不多时,使用re.compile可能是一个显著的优势。这种情况至少发生在一个特定的情况下(我在实践中遇到过),即当以下所有情况都成立时:
您有很多regex模式(不仅仅是re._MAXCACHE,它目前的默认值是512),以及 你经常使用这些正则表达式,而且 相同模式的连续使用之间被多个re._MAXCACHE其他正则表达式分隔,因此每个正则表达式在连续使用之间从缓存中刷新。
import re
import time
def setup(N=1000):
# Patterns 'a.*a', 'a.*b', ..., 'z.*z'
patterns = [chr(i) + '.*' + chr(j)
for i in range(ord('a'), ord('z') + 1)
for j in range(ord('a'), ord('z') + 1)]
# If this assertion below fails, just add more (distinct) patterns.
# assert(re._MAXCACHE < len(patterns))
# N strings. Increase N for larger effect.
strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
return (patterns, strings)
def without_compile():
print('Without re.compile:')
patterns, strings = setup()
print('searching')
count = 0
for s in strings:
for pat in patterns:
count += bool(re.search(pat, s))
return count
def without_compile_cache_friendly():
print('Without re.compile, cache-friendly order:')
patterns, strings = setup()
print('searching')
count = 0
for pat in patterns:
for s in strings:
count += bool(re.search(pat, s))
return count
def with_compile():
print('With re.compile:')
patterns, strings = setup()
print('compiling')
compiled = [re.compile(pattern) for pattern in patterns]
print('searching')
count = 0
for s in strings:
for regex in compiled:
count += bool(regex.search(s))
return count
start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')
start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')
start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')
print(f'Ratio: {d3/d1:.2f}')
我在笔记本电脑上获得的示例输出(Python 3.7.7):
With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.
Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.
Without re.compile:
searching
676000
-- That took 23.54 seconds.
Ratio: 70.89
I didn't bother with timeit as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns) in setup() above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.
其他回答
我同意诚实的亚伯,所给例子中的匹配(…)是不同的。他们不是一对一的比较,因此,结果是不同的。为了简化我的回答,我用A, B, C, D来表示这些函数。哦,是的,我们在re.py中处理的是4个函数而不是3个。
运行这段代码:
h = re.compile('hello') # (A)
h.match('hello world') # (B)
与运行此代码相同:
re.match('hello', 'hello world') # (C)
因为,当查看源代码re.py时,(A + B)意味着:
h = re._compile('hello') # (D)
h.match('hello world')
(C)实际上是:
re._compile('hello').match('hello world')
因此,(C)与(B)并不相同,实际上(C)在调用(D)之后调用(B), (D)也被(A)调用,换句话说,(C) = (A) + (B),因此,在循环中比较(A + B)与在循环中比较(C)的结果相同。
George的regexTest.py为我们证明了这一点。
noncompiled took 4.555 seconds. # (C) in a loop
compiledInLoop took 4.620 seconds. # (A + B) in a loop
compiled took 2.323 seconds. # (A) once + (B) in a loop
大家的兴趣是,如何得到2.323秒的结果。为了确保compile(…)只被调用一次,我们需要将编译后的regex对象存储在内存中。如果使用类,则可以存储对象,并在每次调用函数时重用该对象。
class Foo:
regex = re.compile('hello')
def my_function(text)
return regex.match(text)
如果我们不使用类(这是我今天的要求),那么我没有评论。我还在学习如何在Python中使用全局变量,我知道全局变量不是什么好东西。
还有一点,我认为使用(A) + (B)的方法有优势。以下是我观察到的一些事实(如果我错了,请指正):
Calls A once, it will do one search in the _cache followed by one sre_compile.compile() to create a regex object. Calls A twice, it will do two searches and one compile (because the regex object is cached). If the _cache gets flushed in between, then the regex object is released from memory and Python needs to compile again. (someone suggests that Python won't recompile.) If we keep the regex object by using (A), the regex object will still get into _cache and get flushed somehow. But our code keeps a reference on it and the regex object will not be released from memory. Those, Python need not to compile again. The 2 seconds difference in George's test compiled loop vs compiled is mainly the time required to build the key and search the _cache. It doesn't mean the compile time of regex. George's reallycompile test show what happens if it really re-do the compile every time: it will be 100x slower (he reduced the loop from 1,000,000 to 10,000).
以下是(A + B)比(C)更好的情况:
如果可以在类中缓存regex对象的引用。 如果需要重复调用(B)(在循环内或多次),则必须在循环外缓存对regex对象的引用。
如果(C)足够好:
不能缓存引用。 我们只是偶尔用一次。 总的来说,我们没有太多的正则表达式(假设编译后的正则表达式永远不会被刷新)
简单回顾一下,以下是abc:
h = re.compile('hello') # (A)
h.match('hello world') # (B)
re.match('hello', 'hello world') # (C)
感谢阅读。
一般来说,我发现在编译模式时使用标志比内联使用标志更容易(至少更容易记住如何使用),比如re.I。
>>> foo_pat = re.compile('foo',re.I)
>>> foo_pat.findall('some string FoO bar')
['FoO']
vs
>>> re.findall('(?i)foo','some string FoO bar')
['FoO']
对我来说,re.compile的最大好处是能够将正则表达式的定义与其使用分开。
即使是一个简单的表达式,如0|[1-9][0-9]*(以10为基数,不带前导零的整数),也可能非常复杂,以至于您宁愿不重新输入它,检查是否有任何拼写错误,然后在开始调试时重新检查是否有拼写错误。另外,使用像num或num_b10这样的变量名比0|[1-9][0-9]*更好。
当然可以存储字符串并将它们传递给re.match;然而,这就不那么容易读了:
num = "..."
# then, much later:
m = re.match(num, input)
与编译:
num = re.compile("...")
# then, much later:
m = num.match(input)
虽然它很接近,但当重复使用时,第二句的最后一行感觉更自然、更简单。
使用第二个版本时,正则表达式在使用之前会进行编译。如果你要多次执行它,最好先编译它。如果不是每次编译都匹配一次性的是好的。
有趣的是,编译对我来说确实更有效(Win XP上的Python 2.5.2):
import re
import time
rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average 2 never"
a = 0
t = time.time()
for i in xrange(1000000):
if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
#~ if rgx.match(str):
a += 1
print time.time() - t
按原样运行上述代码一次,并以相反的方式运行两个if行,编译后的正则表达式的速度将提高一倍