在Python中对正则表达式使用compile有什么好处吗?

h = re.compile('hello')
h.match('hello world')

vs

re.match('hello', 'hello world')

当前回答

Ubuntu 22.04:

$ python --version
Python 3.10.6

$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 're.match("[0-9]{3}-[0-9]{3}-[0-9]{4}", "123-123-1234")'; done
1 loop, best of 5: 972 nsec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (186 usec) was more than four times slower than the best time (972 nsec).
10 loops, best of 5: 819 nsec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (13.9 usec) was more than four times slower than the best time (819 nsec).
100 loops, best of 5: 763 nsec per loop
1000 loops, best of 5: 699 nsec per loop
10000 loops, best of 5: 653 nsec per loop
100000 loops, best of 5: 655 nsec per loop
1000000 loops, best of 5: 656 nsec per loop

$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 'r = re.compile("[0-9]{3}-[0-9]{3}-[0-9]{4}")' 'r.match("123-123-1234")'; done
1 loop, best of 5: 985 nsec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (134 usec) was more than four times slower than the best time (985 nsec).
10 loops, best of 5: 775 nsec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (13.9 usec) was more than four times slower than the best time (775 nsec).
100 loops, best of 5: 756 nsec per loop
1000 loops, best of 5: 701 nsec per loop
10000 loops, best of 5: 704 nsec per loop
100000 loops, best of 5: 654 nsec per loop
1000000 loops, best of 5: 651 nsec per loop

其他回答

使用第二个版本时,正则表达式在使用之前会进行编译。如果你要多次执行它,最好先编译它。如果不是每次编译都匹配一次性的是好的。

除了表演。

使用compile帮助我区分的概念 1. 模块(re), 2. 正则表达式对象 3.匹配对象 当我开始学习正则表达式的时候

#regex object
regex_object = re.compile(r'[a-zA-Z]+')
#match object
match_object = regex_object.search('1.Hello')
#matching content
match_object.group()
output:
Out[60]: 'Hello'
V.S.
re.search(r'[a-zA-Z]+','1.Hello').group()
Out[61]: 'Hello'

作为补充,我做了一个详尽的备忘单模块re供您参考。

regex = {
'brackets':{'single_character': ['[]', '.', {'negate':'^'}],
            'capturing_group' : ['()','(?:)', '(?!)' '|', '\\', 'backreferences and named group'],
            'repetition'      : ['{}', '*?', '+?', '??', 'greedy v.s. lazy ?']},
'lookaround' :{'lookahead'  : ['(?=...)', '(?!...)'],
            'lookbehind' : ['(?<=...)','(?<!...)'],
            'caputuring' : ['(?P<name>...)', '(?P=name)', '(?:)'],},
'escapes':{'anchor'          : ['^', '\b', '$'],
          'non_printable'   : ['\n', '\t', '\r', '\f', '\v'],
          'shorthand'       : ['\d', '\w', '\s']},
'methods': {['search', 'match', 'findall', 'finditer'],
              ['split', 'sub']},
'match_object': ['group','groups', 'groupdict','start', 'end', 'span',]
}

我的理解是,这两个例子实际上是等价的。唯一的区别是,在第一种情况下,您可以在其他地方重用已编译的正则表达式,而不会导致再次编译它。

这里有一个参考:http://diveintopython3.ep.io/refactoring.html

使用字符串'M'调用已编译模式对象的搜索函数,其效果与同时使用正则表达式和字符串'M'调用re.search相同。只是要快得多。(事实上,re.search函数只是编译正则表达式,并为您调用结果模式对象的搜索方法。)

我想说的是,预编译在概念上和“字面上”(如在“文学编程”中)都是有利的。看看这段代码片段:

from re import compile as _Re

class TYPO:

  def text_has_foobar( self, text ):
    return self._text_has_foobar_re_search( text ) is not None
  _text_has_foobar_re_search = _Re( r"""(?i)foobar""" ).search

TYPO = TYPO()

在你的应用程序中,你可以这样写:

from TYPO import TYPO
print( TYPO.text_has_foobar( 'FOObar ) )

this is about as simple in terms of functionality as it can get. because this is example is so short, i conflated the way to get _text_has_foobar_re_search all in one line. the disadvantage of this code is that it occupies a little memory for whatever the lifetime of the TYPO library object is; the advantage is that when doing a foobar search, you'll get away with two function calls and two class dictionary lookups. how many regexes are cached by re and the overhead of that cache are irrelevant here.

将其与更常见的风格进行比较,如下所示:

import re

class Typo:

  def text_has_foobar( self, text ):
    return re.compile( r"""(?i)foobar""" ).search( text ) is not None

在应用中:

typo = Typo()
print( typo.text_has_foobar( 'FOObar ) )

我很乐意承认我的风格在python中是非常不寻常的,甚至可能是有争议的。然而,在更接近python的使用方式的示例中,为了进行一次匹配,我们必须实例化一个对象,进行三次实例字典查找,并执行三次函数调用;此外,当使用超过100个正则表达式时,我们可能会遇到重新缓存的麻烦。此外,正则表达式被隐藏在方法体中,这在大多数情况下并不是一个好主意。

可以说,每一个措施的子集——有针对性的,别名的import语句;别名方法(如适用);减少函数调用和对象字典查找——可以帮助减少计算和概念的复杂性。

我真的很尊重上面所有的答案。在我看来 是的!当然,使用re.compile而不是一次又一次地编译正则表达式是值得的。

使用re.compile可以使代码更加动态,因为您可以调用已经编译好的正则表达式,而不是一次又一次地编译。这对你有好处:

处理器的努力 时间复杂度。 使正则表达式通用。(可以在findall, search, match中使用) 并使您的程序看起来很酷。

例子:

  example_string = "The room number of her room is 26A7B."
  find_alpha_numeric_string = re.compile(r"\b\w+\b")

在Findall中使用

 find_alpha_numeric_string.findall(example_string)

在搜索中使用

  find_alpha_numeric_string.search(example_string)

类似地,您可以将它用于:Match和Substitute