值得使用Python的re.compile吗?

在Python中对正则表达式使用compile有什么好处吗?

h = re.compile('hello')
h.match('hello world')

re.match('hello', 'hello world')

当前回答

我想说的是，预编译在概念上和“字面上”(如在“文学编程”中)都是有利的。看看这段代码片段:

from re import compile as _Re

class TYPO:

  def text_has_foobar( self, text ):
    return self._text_has_foobar_re_search( text ) is not None
  _text_has_foobar_re_search = _Re( r"""(?i)foobar""" ).search

TYPO = TYPO()

在你的应用程序中，你可以这样写:

from TYPO import TYPO
print( TYPO.text_has_foobar( 'FOObar ) )

this is about as simple in terms of functionality as it can get. because this is example is so short, i conflated the way to get _text_has_foobar_re_search all in one line. the disadvantage of this code is that it occupies a little memory for whatever the lifetime of the TYPO library object is; the advantage is that when doing a foobar search, you'll get away with two function calls and two class dictionary lookups. how many regexes are cached by re and the overhead of that cache are irrelevant here.

将其与更常见的风格进行比较，如下所示:

import re

class Typo:

  def text_has_foobar( self, text ):
    return re.compile( r"""(?i)foobar""" ).search( text ) is not None

在应用中:

typo = Typo()
print( typo.text_has_foobar( 'FOObar ) )

我很乐意承认我的风格在python中是非常不寻常的，甚至可能是有争议的。然而，在更接近python的使用方式的示例中，为了进行一次匹配，我们必须实例化一个对象，进行三次实例字典查找，并执行三次函数调用;此外，当使用超过100个正则表达式时，我们可能会遇到重新缓存的麻烦。此外，正则表达式被隐藏在方法体中，这在大多数情况下并不是一个好主意。

可以说，每一个措施的子集——有针对性的，别名的import语句;别名方法(如适用);减少函数调用和对象字典查找——可以帮助减少计算和概念的复杂性。

2010-11-06 20:03:49

其他回答

大多数情况下，是否使用re.compile没有什么区别。在内部，所有函数都是按照编译步骤实现的:

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def fullmatch(pattern, string, flags=0):
    return _compile(pattern, flags).fullmatch(string)

def search(pattern, string, flags=0):
    return _compile(pattern, flags).search(string)

def sub(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).sub(repl, string, count)

def subn(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).subn(repl, string, count)

def split(pattern, string, maxsplit=0, flags=0):
    return _compile(pattern, flags).split(string, maxsplit)

def findall(pattern, string, flags=0):
    return _compile(pattern, flags).findall(string)

def finditer(pattern, string, flags=0):
    return _compile(pattern, flags).finditer(string)

此外，re.compile()绕过了额外的间接和缓存逻辑:

_cache = {}

_pattern_type = type(sre_compile.compile("", 0))

_MAXCACHE = 512
def _compile(pattern, flags):
    # internal: compile pattern
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        if flags:
            raise ValueError(
                "cannot process flags argument with a compiled pattern")
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if not (flags & DEBUG):
        if len(_cache) >= _MAXCACHE:
            _cache.clear()
        if p.flags & LOCALE:
            if not _locale:
                return p
            loc = _locale.setlocale(_locale.LC_CTYPE)
        else:
            loc = None
        _cache[type(pattern), pattern, flags] = p, loc
    return p

除了使用re.compile带来的小速度好处外，人们还喜欢命名潜在复杂的模式规范并将其与应用的业务逻辑分离所带来的可读性:

#### Patterns ############################################################
number_pattern = re.compile(r'\d+(\.\d*)?')    # Integer or decimal number
assign_pattern = re.compile(r':=')             # Assignment operator
identifier_pattern = re.compile(r'[A-Za-z]+')  # Identifiers
whitespace_pattern = re.compile(r'[\t ]+')     # Spaces and tabs

#### Applications ########################################################

if whitespace_pattern.match(s): business_logic_rule_1()
if assign_pattern.match(s): business_logic_rule_2()

注意，另一位受访者错误地认为pyc文件直接存储已编译的模式;然而，在现实中，每次PYC加载时，它们都会被重新构建:

>>> from dis import dis
>>> with open('tmp.pyc', 'rb') as f:
        f.read(8)
        dis(marshal.load(f))

  1           0 LOAD_CONST               0 (-1)
              3 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (re)
              9 STORE_NAME               0 (re)

  3          12 LOAD_NAME                0 (re)
             15 LOAD_ATTR                1 (compile)
             18 LOAD_CONST               2 ('[aeiou]{2,5}')
             21 CALL_FUNCTION            1
             24 STORE_NAME               2 (lc_vowels)
             27 LOAD_CONST               1 (None)
             30 RETURN_VALUE

上面的分解来自于一个包含tmp.py的PYC文件:

import re
lc_vowels = re.compile(r'[aeiou]{2,5}')

2017-02-21 07:16:00

(几个月后)很容易在re.match周围添加自己的缓存，或者其他任何事情——

""" Re.py: Re.match = re.match + cache  
    efficiency: re.py does this already (but what's _MAXCACHE ?)
    readability, inline / separate: matter of taste
"""

import re

cache = {}
_re_type = type( re.compile( "" ))

def match( pattern, str, *opt ):
    """ Re.match = re.match + cache re.compile( pattern ) 
    """
    if type(pattern) == _re_type:
        cpat = pattern
    elif pattern in cache:
        cpat = cache[pattern]
    else:
        cpat = cache[pattern] = re.compile( pattern, *opt )
    return cpat.match( str )

# def search ...

一个wibni，如果:cachehint(size=)， cacheinfo() -> size, hits, nclear…

2009-07-06 10:13:44

尽管这两种方法在速度方面是可以比较的，但是您应该知道，如果您正在处理数百万次迭代，那么仍然存在一些可以忽略不计的时间差。

以下速度测试:

import re
import time

SIZE = 100_000_000

start = time.time()
foo = re.compile('foo')
[foo.search('bar') for _ in range(SIZE)]
print('compiled:  ', time.time() - start)

start = time.time()
[re.search('foo', 'bar') for _ in range(SIZE)]
print('uncompiled:', time.time() - start)

给出了以下结果:

compiled:   14.647532224655151
uncompiled: 61.483458042144775

编译后的方法在我的PC上(使用Python 3.7.0)始终快大约4倍。

如文档中所述:

如果在循环中访问正则表达式，预编译它将节省一些函数调用。在循环之外，由于内部缓存，没有太大区别。

2021-07-16 09:30:40

一般来说，我发现在编译模式时使用标志比内联使用标志更容易(至少更容易记住如何使用)，比如re.I。

>>> foo_pat = re.compile('foo',re.I)
>>> foo_pat.findall('some string FoO bar')
['FoO']

>>> re.findall('(?i)foo','some string FoO bar')
['FoO']

2009-03-18 22:13:45

使用第二个版本时，正则表达式在使用之前会进行编译。如果你要多次执行它，最好先编译它。如果不是每次编译都匹配一次性的是好的。

2009-01-16 21:36:28

值得使用Python的re.compile吗?

推荐文章

最新文章

标签