在Python中对正则表达式使用compile有什么好处吗?
h = re.compile('hello')
h.match('hello world')
vs
re.match('hello', 'hello world')
在Python中对正则表达式使用compile有什么好处吗?
h = re.compile('hello')
h.match('hello world')
vs
re.match('hello', 'hello world')
当前回答
有趣的是,编译对我来说确实更有效(Win XP上的Python 2.5.2):
import re
import time
rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average 2 never"
a = 0
t = time.time()
for i in xrange(1000000):
if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
#~ if rgx.match(str):
a += 1
print time.time() - t
按原样运行上述代码一次,并以相反的方式运行两个if行,编译后的正则表达式的速度将提高一倍
其他回答
下面是一个使用re.compile的示例,在请求时速度超过50倍。
这一点与我在上面的评论中所说的是一样的,即当您的使用从编译缓存中获益不多时,使用re.compile可能是一个显著的优势。这种情况至少发生在一个特定的情况下(我在实践中遇到过),即当以下所有情况都成立时:
您有很多regex模式(不仅仅是re._MAXCACHE,它目前的默认值是512),以及 你经常使用这些正则表达式,而且 相同模式的连续使用之间被多个re._MAXCACHE其他正则表达式分隔,因此每个正则表达式在连续使用之间从缓存中刷新。
import re
import time
def setup(N=1000):
# Patterns 'a.*a', 'a.*b', ..., 'z.*z'
patterns = [chr(i) + '.*' + chr(j)
for i in range(ord('a'), ord('z') + 1)
for j in range(ord('a'), ord('z') + 1)]
# If this assertion below fails, just add more (distinct) patterns.
# assert(re._MAXCACHE < len(patterns))
# N strings. Increase N for larger effect.
strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
return (patterns, strings)
def without_compile():
print('Without re.compile:')
patterns, strings = setup()
print('searching')
count = 0
for s in strings:
for pat in patterns:
count += bool(re.search(pat, s))
return count
def without_compile_cache_friendly():
print('Without re.compile, cache-friendly order:')
patterns, strings = setup()
print('searching')
count = 0
for pat in patterns:
for s in strings:
count += bool(re.search(pat, s))
return count
def with_compile():
print('With re.compile:')
patterns, strings = setup()
print('compiling')
compiled = [re.compile(pattern) for pattern in patterns]
print('searching')
count = 0
for s in strings:
for regex in compiled:
count += bool(regex.search(s))
return count
start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')
start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')
start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')
print(f'Ratio: {d3/d1:.2f}')
我在笔记本电脑上获得的示例输出(Python 3.7.7):
With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.
Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.
Without re.compile:
searching
676000
-- That took 23.54 seconds.
Ratio: 70.89
I didn't bother with timeit as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns) in setup() above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.
除了表演。
使用compile帮助我区分的概念 1. 模块(re), 2. 正则表达式对象 3.匹配对象 当我开始学习正则表达式的时候
#regex object
regex_object = re.compile(r'[a-zA-Z]+')
#match object
match_object = regex_object.search('1.Hello')
#matching content
match_object.group()
output:
Out[60]: 'Hello'
V.S.
re.search(r'[a-zA-Z]+','1.Hello').group()
Out[61]: 'Hello'
作为补充,我做了一个详尽的备忘单模块re供您参考。
regex = {
'brackets':{'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()','(?:)', '(?!)' '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*?', '+?', '??', 'greedy v.s. lazy ?']},
'lookaround' :{'lookahead' : ['(?=...)', '(?!...)'],
'lookbehind' : ['(?<=...)','(?<!...)'],
'caputuring' : ['(?P<name>...)', '(?P=name)', '(?:)'],},
'escapes':{'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s']},
'methods': {['search', 'match', 'findall', 'finditer'],
['split', 'sub']},
'match_object': ['group','groups', 'groupdict','start', 'end', 'span',]
}
我想说的是,预编译在概念上和“字面上”(如在“文学编程”中)都是有利的。看看这段代码片段:
from re import compile as _Re
class TYPO:
def text_has_foobar( self, text ):
return self._text_has_foobar_re_search( text ) is not None
_text_has_foobar_re_search = _Re( r"""(?i)foobar""" ).search
TYPO = TYPO()
在你的应用程序中,你可以这样写:
from TYPO import TYPO
print( TYPO.text_has_foobar( 'FOObar ) )
this is about as simple in terms of functionality as it can get. because this is example is so short, i conflated the way to get _text_has_foobar_re_search all in one line. the disadvantage of this code is that it occupies a little memory for whatever the lifetime of the TYPO library object is; the advantage is that when doing a foobar search, you'll get away with two function calls and two class dictionary lookups. how many regexes are cached by re and the overhead of that cache are irrelevant here.
将其与更常见的风格进行比较,如下所示:
import re
class Typo:
def text_has_foobar( self, text ):
return re.compile( r"""(?i)foobar""" ).search( text ) is not None
在应用中:
typo = Typo()
print( typo.text_has_foobar( 'FOObar ) )
我很乐意承认我的风格在python中是非常不寻常的,甚至可能是有争议的。然而,在更接近python的使用方式的示例中,为了进行一次匹配,我们必须实例化一个对象,进行三次实例字典查找,并执行三次函数调用;此外,当使用超过100个正则表达式时,我们可能会遇到重新缓存的麻烦。此外,正则表达式被隐藏在方法体中,这在大多数情况下并不是一个好主意。
可以说,每一个措施的子集——有针对性的,别名的import语句;别名方法(如适用);减少函数调用和对象字典查找——可以帮助减少计算和概念的复杂性。
一般来说,我发现在编译模式时使用标志比内联使用标志更容易(至少更容易记住如何使用),比如re.I。
>>> foo_pat = re.compile('foo',re.I)
>>> foo_pat.findall('some string FoO bar')
['FoO']
vs
>>> re.findall('(?i)foo','some string FoO bar')
['FoO']
有趣的是,编译对我来说确实更有效(Win XP上的Python 2.5.2):
import re
import time
rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average 2 never"
a = 0
t = time.time()
for i in xrange(1000000):
if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
#~ if rgx.match(str):
a += 1
print time.time() - t
按原样运行上述代码一次,并以相反的方式运行两个if行,编译后的正则表达式的速度将提高一倍