在Python中对正则表达式使用compile有什么好处吗?
h = re.compile('hello')
h.match('hello world')
vs
re.match('hello', 'hello world')
在Python中对正则表达式使用compile有什么好处吗?
h = re.compile('hello')
h.match('hello world')
vs
re.match('hello', 'hello world')
当前回答
使用re.compile()还有一个额外的好处,即使用re.VERBOSE向正则表达式模式添加注释
pattern = '''
hello[ ]world # Some info on my pattern logic. [ ] to recognize space
'''
re.search(pattern, 'hello world', re.VERBOSE)
虽然这不会影响代码的运行速度,但我喜欢这样做,因为这是我注释习惯的一部分。当我想要修改代码时,我完全不喜欢花时间去记住代码背后的逻辑。
其他回答
在无意中看到这里的讨论之前,我运行了这个测试。然而,在运行它之后,我想我至少会发布我的结果。
我剽窃了Jeff Friedl的“精通正则表达式”中的例子。这是在一台运行OSX 10.6 (2Ghz英特尔酷睿2双核,4GB内存)的macbook上。Python版本为2.6.1。
运行1 -使用re.compile
import re
import time
import fpformat
Regex1 = re.compile('^(a|b|c|d|e|f|g)+$')
Regex2 = re.compile('^[a-g]+$')
TimesToDo = 1000
TestString = ""
for i in range(1000):
TestString += "abababdedfg"
StartTime = time.time()
for i in range(TimesToDo):
Regex1.search(TestString)
Seconds = time.time() - StartTime
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"
StartTime = time.time()
for i in range(TimesToDo):
Regex2.search(TestString)
Seconds = time.time() - StartTime
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"
Alternation takes 2.299 seconds
Character Class takes 0.107 seconds
运行2 -不使用re.compile
import re
import time
import fpformat
TimesToDo = 1000
TestString = ""
for i in range(1000):
TestString += "abababdedfg"
StartTime = time.time()
for i in range(TimesToDo):
re.search('^(a|b|c|d|e|f|g)+$',TestString)
Seconds = time.time() - StartTime
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"
StartTime = time.time()
for i in range(TimesToDo):
re.search('^[a-g]+$',TestString)
Seconds = time.time() - StartTime
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"
Alternation takes 2.508 seconds
Character Class takes 0.109 seconds
使用re.compile()还有一个额外的好处,即使用re.VERBOSE向正则表达式模式添加注释
pattern = '''
hello[ ]world # Some info on my pattern logic. [ ] to recognize space
'''
re.search(pattern, 'hello world', re.VERBOSE)
虽然这不会影响代码的运行速度,但我喜欢这样做,因为这是我注释习惯的一部分。当我想要修改代码时,我完全不喜欢花时间去记住代码背后的逻辑。
除了表演。
使用compile帮助我区分的概念 1. 模块(re), 2. 正则表达式对象 3.匹配对象 当我开始学习正则表达式的时候
#regex object
regex_object = re.compile(r'[a-zA-Z]+')
#match object
match_object = regex_object.search('1.Hello')
#matching content
match_object.group()
output:
Out[60]: 'Hello'
V.S.
re.search(r'[a-zA-Z]+','1.Hello').group()
Out[61]: 'Hello'
作为补充,我做了一个详尽的备忘单模块re供您参考。
regex = {
'brackets':{'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()','(?:)', '(?!)' '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*?', '+?', '??', 'greedy v.s. lazy ?']},
'lookaround' :{'lookahead' : ['(?=...)', '(?!...)'],
'lookbehind' : ['(?<=...)','(?<!...)'],
'caputuring' : ['(?P<name>...)', '(?P=name)', '(?:)'],},
'escapes':{'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s']},
'methods': {['search', 'match', 'findall', 'finditer'],
['split', 'sub']},
'match_object': ['group','groups', 'groupdict','start', 'end', 'span',]
}
我有很多运行一个编译过的正则表达式和实时编译的经验,并没有注意到任何可感知的差异。显然,这只是传闻,当然也不是反对编译的有力论据,但我发现两者之间的差异可以忽略不计。
编辑: 在快速浏览了实际的Python 2.5库代码后,我发现无论何时使用正则表达式(包括调用re.match()), Python都会在内部编译和缓存正则表达式,因此实际上只在正则表达式被编译时进行更改,并且不应该节省太多时间——只节省检查缓存所需的时间(对内部dict类型的键查找)。
来自re.py模块(评论是我的):
def match(pattern, string, flags=0):
return _compile(pattern, flags).match(string)
def _compile(*key):
# Does cache check at top of function
cachekey = (type(key[0]),) + key
p = _cache.get(cachekey)
if p is not None: return p
# ...
# Does actual compilation on cache miss
# ...
# Caches compiled regex
if len(_cache) >= _MAXCACHE:
_cache.clear()
_cache[cachekey] = p
return p
我仍然经常预编译正则表达式,但只是为了将它们绑定到一个漂亮的、可重用的名称,而不是为了任何预期的性能提升。
下面是一个使用re.compile的示例,在请求时速度超过50倍。
这一点与我在上面的评论中所说的是一样的,即当您的使用从编译缓存中获益不多时,使用re.compile可能是一个显著的优势。这种情况至少发生在一个特定的情况下(我在实践中遇到过),即当以下所有情况都成立时:
您有很多regex模式(不仅仅是re._MAXCACHE,它目前的默认值是512),以及 你经常使用这些正则表达式,而且 相同模式的连续使用之间被多个re._MAXCACHE其他正则表达式分隔,因此每个正则表达式在连续使用之间从缓存中刷新。
import re
import time
def setup(N=1000):
# Patterns 'a.*a', 'a.*b', ..., 'z.*z'
patterns = [chr(i) + '.*' + chr(j)
for i in range(ord('a'), ord('z') + 1)
for j in range(ord('a'), ord('z') + 1)]
# If this assertion below fails, just add more (distinct) patterns.
# assert(re._MAXCACHE < len(patterns))
# N strings. Increase N for larger effect.
strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
return (patterns, strings)
def without_compile():
print('Without re.compile:')
patterns, strings = setup()
print('searching')
count = 0
for s in strings:
for pat in patterns:
count += bool(re.search(pat, s))
return count
def without_compile_cache_friendly():
print('Without re.compile, cache-friendly order:')
patterns, strings = setup()
print('searching')
count = 0
for pat in patterns:
for s in strings:
count += bool(re.search(pat, s))
return count
def with_compile():
print('With re.compile:')
patterns, strings = setup()
print('compiling')
compiled = [re.compile(pattern) for pattern in patterns]
print('searching')
count = 0
for s in strings:
for regex in compiled:
count += bool(regex.search(s))
return count
start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')
start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')
start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')
print(f'Ratio: {d3/d1:.2f}')
我在笔记本电脑上获得的示例输出(Python 3.7.7):
With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.
Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.
Without re.compile:
searching
676000
-- That took 23.54 seconds.
Ratio: 70.89
I didn't bother with timeit as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns) in setup() above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.