在Python中对正则表达式使用compile有什么好处吗?

h = re.compile('hello')
h.match('hello world')

vs

re.match('hello', 'hello world')

当前回答

在无意中看到这里的讨论之前,我运行了这个测试。然而,在运行它之后,我想我至少会发布我的结果。

我剽窃了Jeff Friedl的“精通正则表达式”中的例子。这是在一台运行OSX 10.6 (2Ghz英特尔酷睿2双核,4GB内存)的macbook上。Python版本为2.6.1。

运行1 -使用re.compile

import re 
import time 
import fpformat
Regex1 = re.compile('^(a|b|c|d|e|f|g)+$') 
Regex2 = re.compile('^[a-g]+$')
TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    Regex1.search(TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    Regex2.search(TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.299 seconds
Character Class takes 0.107 seconds

运行2 -不使用re.compile

import re 
import time 
import fpformat

TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^(a|b|c|d|e|f|g)+$',TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^[a-g]+$',TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.508 seconds
Character Class takes 0.109 seconds

其他回答

我的理解是,这两个例子实际上是等价的。唯一的区别是,在第一种情况下,您可以在其他地方重用已编译的正则表达式,而不会导致再次编译它。

这里有一个参考:http://diveintopython3.ep.io/refactoring.html

使用字符串'M'调用已编译模式对象的搜索函数,其效果与同时使用正则表达式和字符串'M'调用re.search相同。只是要快得多。(事实上,re.search函数只是编译正则表达式,并为您调用结果模式对象的搜索方法。)

FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

因此,如果您将经常使用同一个正则表达式,可能值得执行re.compile(特别是对于更复杂的正则表达式)。

反对过早优化的标准论点适用,但如果您怀疑regexp可能成为性能瓶颈,我不认为使用re.compile会真正失去多少清晰度/直接性。

更新:

在Python 3.6(我怀疑上述计时是使用Python 2.x完成的)和2018硬件(MacBook Pro)下,我现在得到以下计时:

% python -m timeit -s "import re" "re.match('hello', 'hello world')"
1000000 loops, best of 3: 0.661 usec per loop

% python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 0.285 usec per loop

% python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"
1000000 loops, best of 3: 0.65 usec per loop

% python --version
Python 3.6.5 :: Anaconda, Inc.

我还添加了一个案例(注意最后两次运行之间的引号差异),表明re.match(x,…)从字面上[大致]等价于re.compile(x).match(…),即似乎没有发生编译表示的幕后缓存。

下面是一个使用re.compile的示例,在请求时速度超过50倍。

这一点与我在上面的评论中所说的是一样的,即当您的使用从编译缓存中获益不多时,使用re.compile可能是一个显著的优势。这种情况至少发生在一个特定的情况下(我在实践中遇到过),即当以下所有情况都成立时:

您有很多regex模式(不仅仅是re._MAXCACHE,它目前的默认值是512),以及 你经常使用这些正则表达式,而且 相同模式的连续使用之间被多个re._MAXCACHE其他正则表达式分隔,因此每个正则表达式在连续使用之间从缓存中刷新。

import re
import time

def setup(N=1000):
    # Patterns 'a.*a', 'a.*b', ..., 'z.*z'
    patterns = [chr(i) + '.*' + chr(j)
                    for i in range(ord('a'), ord('z') + 1)
                    for j in range(ord('a'), ord('z') + 1)]
    # If this assertion below fails, just add more (distinct) patterns.
    # assert(re._MAXCACHE < len(patterns))
    # N strings. Increase N for larger effect.
    strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
    return (patterns, strings)

def without_compile():
    print('Without re.compile:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for s in strings:
        for pat in patterns:
            count += bool(re.search(pat, s))
    return count

def without_compile_cache_friendly():
    print('Without re.compile, cache-friendly order:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for pat in patterns:
        for s in strings:
            count += bool(re.search(pat, s))
    return count

def with_compile():
    print('With re.compile:')
    patterns, strings = setup()
    print('compiling')
    compiled = [re.compile(pattern) for pattern in patterns]
    print('searching')
    count = 0
    for s in strings:
        for regex in compiled:
            count += bool(regex.search(s))
    return count

start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')

start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')

start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')

print(f'Ratio: {d3/d1:.2f}')

我在笔记本电脑上获得的示例输出(Python 3.7.7):

With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.

Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.

Without re.compile:
searching
676000
-- That took 23.54 seconds.

Ratio: 70.89

I didn't bother with timeit as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns) in setup() above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.

有趣的是,编译对我来说确实更有效(Win XP上的Python 2.5.2):

import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average    2 never"
a = 0

t = time.time()

for i in xrange(1000000):
    if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
    #~ if rgx.match(str):
        a += 1

print time.time() - t

按原样运行上述代码一次,并以相反的方式运行两个if行,编译后的正则表达式的速度将提高一倍

我有很多运行一个编译过的正则表达式和实时编译的经验,并没有注意到任何可感知的差异。显然,这只是传闻,当然也不是反对编译的有力论据,但我发现两者之间的差异可以忽略不计。

编辑: 在快速浏览了实际的Python 2.5库代码后,我发现无论何时使用正则表达式(包括调用re.match()), Python都会在内部编译和缓存正则表达式,因此实际上只在正则表达式被编译时进行更改,并且不应该节省太多时间——只节省检查缓存所需的时间(对内部dict类型的键查找)。

来自re.py模块(评论是我的):

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def _compile(*key):

    # Does cache check at top of function
    cachekey = (type(key[0]),) + key
    p = _cache.get(cachekey)
    if p is not None: return p

    # ...
    # Does actual compilation on cache miss
    # ...

    # Caches compiled regex
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    _cache[cachekey] = p
    return p

我仍然经常预编译正则表达式,但只是为了将它们绑定到一个漂亮的、可重用的名称,而不是为了任何预期的性能提升。