在Python中对正则表达式使用compile有什么好处吗?

h = re.compile('hello')
h.match('hello world')

vs

re.match('hello', 'hello world')

当前回答

这是个好问题。你经常看到人们毫无理由地使用re.compile。它降低了可读性。但是可以肯定的是,很多时候需要预编译表达式。就像你在循环中重复使用它一样。

这就像编程的一切(实际上是生活中的一切)。运用常识。

其他回答

有趣的是,编译对我来说确实更有效(Win XP上的Python 2.5.2):

import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average    2 never"
a = 0

t = time.time()

for i in xrange(1000000):
    if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
    #~ if rgx.match(str):
        a += 1

print time.time() - t

按原样运行上述代码一次,并以相反的方式运行两个if行,编译后的正则表达式的速度将提高一倍

作为一个替代答案,正如我看到之前没有提到的,我将继续引用Python 3文档:

您是应该使用这些模块级函数,还是应该获取模式并自己调用它的方法?如果在循环中访问正则表达式,预编译它将节省一些函数调用。在循环之外,由于内部缓存,没有太大区别。

(几个月后)很容易在re.match周围添加自己的缓存, 或者其他任何事情——

""" Re.py: Re.match = re.match + cache  
    efficiency: re.py does this already (but what's _MAXCACHE ?)
    readability, inline / separate: matter of taste
"""

import re

cache = {}
_re_type = type( re.compile( "" ))

def match( pattern, str, *opt ):
    """ Re.match = re.match + cache re.compile( pattern ) 
    """
    if type(pattern) == _re_type:
        cpat = pattern
    elif pattern in cache:
        cpat = cache[pattern]
    else:
        cpat = cache[pattern] = re.compile( pattern, *opt )
    return cpat.match( str )

# def search ...

一个wibni,如果:cachehint(size=), cacheinfo() -> size, hits, nclear…

下面是一个使用re.compile的示例,在请求时速度超过50倍。

这一点与我在上面的评论中所说的是一样的,即当您的使用从编译缓存中获益不多时,使用re.compile可能是一个显著的优势。这种情况至少发生在一个特定的情况下(我在实践中遇到过),即当以下所有情况都成立时:

您有很多regex模式(不仅仅是re._MAXCACHE,它目前的默认值是512),以及 你经常使用这些正则表达式,而且 相同模式的连续使用之间被多个re._MAXCACHE其他正则表达式分隔,因此每个正则表达式在连续使用之间从缓存中刷新。

import re
import time

def setup(N=1000):
    # Patterns 'a.*a', 'a.*b', ..., 'z.*z'
    patterns = [chr(i) + '.*' + chr(j)
                    for i in range(ord('a'), ord('z') + 1)
                    for j in range(ord('a'), ord('z') + 1)]
    # If this assertion below fails, just add more (distinct) patterns.
    # assert(re._MAXCACHE < len(patterns))
    # N strings. Increase N for larger effect.
    strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
    return (patterns, strings)

def without_compile():
    print('Without re.compile:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for s in strings:
        for pat in patterns:
            count += bool(re.search(pat, s))
    return count

def without_compile_cache_friendly():
    print('Without re.compile, cache-friendly order:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for pat in patterns:
        for s in strings:
            count += bool(re.search(pat, s))
    return count

def with_compile():
    print('With re.compile:')
    patterns, strings = setup()
    print('compiling')
    compiled = [re.compile(pattern) for pattern in patterns]
    print('searching')
    count = 0
    for s in strings:
        for regex in compiled:
            count += bool(regex.search(s))
    return count

start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')

start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')

start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')

print(f'Ratio: {d3/d1:.2f}')

我在笔记本电脑上获得的示例输出(Python 3.7.7):

With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.

Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.

Without re.compile:
searching
676000
-- That took 23.54 seconds.

Ratio: 70.89

I didn't bother with timeit as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns) in setup() above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.

使用re.compile()还有一个额外的好处,即使用re.VERBOSE向正则表达式模式添加注释

pattern = '''
hello[ ]world    # Some info on my pattern logic. [ ] to recognize space
'''

re.search(pattern, 'hello world', re.VERBOSE)

虽然这不会影响代码的运行速度,但我喜欢这样做,因为这是我注释习惯的一部分。当我想要修改代码时,我完全不喜欢花时间去记住代码背后的逻辑。