值得使用Python的re.compile吗?

在Python中对正则表达式使用compile有什么好处吗?

h = re.compile('hello')
h.match('hello world')

re.match('hello', 'hello world')

当前回答

有趣的是，编译对我来说确实更有效(Win XP上的Python 2.5.2):

import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average    2 never"
a = 0

t = time.time()

for i in xrange(1000000):
    if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
    #~ if rgx.match(str):
        a += 1

print time.time() - t

按原样运行上述代码一次，并以相反的方式运行两个if行，编译后的正则表达式的速度将提高一倍

2009-01-20 18:06:57

其他回答

我自己刚试过。对于从字符串中解析数字并对其求和的简单情况，使用编译后的正则表达式对象的速度大约是使用re方法的两倍。

正如其他人指出的那样，re方法(包括re.compile)在以前编译的表达式缓存中查找正则表达式字符串。因此，在正常情况下，使用re方法的额外成本只是缓存查找的成本。

然而，检查代码，缓存被限制为100个表达式。这就引出了一个问题，缓存溢出有多痛苦?该代码包含正则表达式编译器的内部接口re.sre_compile.compile。如果我们调用它，就绕过了缓存。结果表明，对于一个基本的正则表达式，例如r'\w+\s+([0-9_]+)\s+\w*'，它要慢两个数量级。

下面是我的测试:

#!/usr/bin/env python
import re
import time

def timed(func):
    def wrapper(*args):
        t = time.time()
        result = func(*args)
        t = time.time() - t
        print '%s took %.3f seconds.' % (func.func_name, t)
        return result
    return wrapper

regularExpression = r'\w+\s+([0-9_]+)\s+\w*'
testString = "average    2 never"

@timed
def noncompiled():
    a = 0
    for x in xrange(1000000):
        m = re.match(regularExpression, testString)
        a += int(m.group(1))
    return a

@timed
def compiled():
    a = 0
    rgx = re.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiled():
    a = 0
    rgx = re.sre_compile.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a


@timed
def compiledInLoop():
    a = 0
    for x in xrange(1000000):
        rgx = re.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiledInLoop():
    a = 0
    for x in xrange(10000):
        rgx = re.sre_compile.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

r1 = noncompiled()
r2 = compiled()
r3 = reallyCompiled()
r4 = compiledInLoop()
r5 = reallyCompiledInLoop()
print "r1 = ", r1
print "r2 = ", r2
print "r3 = ", r3
print "r4 = ", r4
print "r5 = ", r5
</pre>
And here is the output on my machine:
<pre>
$ regexTest.py 
noncompiled took 4.555 seconds.
compiled took 2.323 seconds.
reallyCompiled took 2.325 seconds.
compiledInLoop took 4.620 seconds.
reallyCompiledInLoop took 4.074 seconds.
r1 =  2000000
r2 =  2000000
r3 =  2000000
r4 =  2000000
r5 =  20000

'reallyCompiled'方法使用内部接口，绕过缓存。注意，在每个循环迭代中编译的代码只迭代了10,000次，而不是一百万次。

2010-04-14 04:40:24

FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

因此，如果您将经常使用同一个正则表达式，可能值得执行re.compile(特别是对于更复杂的正则表达式)。

反对过早优化的标准论点适用，但如果您怀疑regexp可能成为性能瓶颈，我不认为使用re.compile会真正失去多少清晰度/直接性。

更新:

在Python 3.6(我怀疑上述计时是使用Python 2.x完成的)和2018硬件(MacBook Pro)下，我现在得到以下计时:

% python -m timeit -s "import re" "re.match('hello', 'hello world')"
1000000 loops, best of 3: 0.661 usec per loop

% python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 0.285 usec per loop

% python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"
1000000 loops, best of 3: 0.65 usec per loop

% python --version
Python 3.6.5 :: Anaconda, Inc.

我还添加了一个案例(注意最后两次运行之间的引号差异)，表明re.match(x，…)从字面上[大致]等价于re.compile(x).match(…)，即似乎没有发生编译表示的幕后缓存。

2009-01-16 21:42:37

对我来说，re.compile的最大好处是能够将正则表达式的定义与其使用分开。

即使是一个简单的表达式，如0|[1-9][0-9]*(以10为基数，不带前导零的整数)，也可能非常复杂，以至于您宁愿不重新输入它，检查是否有任何拼写错误，然后在开始调试时重新检查是否有拼写错误。另外，使用像num或num_b10这样的变量名比0|[1-9][0-9]*更好。

当然可以存储字符串并将它们传递给re.match;然而，这就不那么容易读了:

num = "..."
# then, much later:
m = re.match(num, input)

与编译:

num = re.compile("...")
# then, much later:
m = num.match(input)

虽然它很接近，但当重复使用时，第二句的最后一行感觉更自然、更简单。

2009-01-17 16:49:07

在无意中看到这里的讨论之前，我运行了这个测试。然而，在运行它之后，我想我至少会发布我的结果。

我剽窃了Jeff Friedl的“精通正则表达式”中的例子。这是在一台运行OSX 10.6 (2Ghz英特尔酷睿2双核，4GB内存)的macbook上。Python版本为2.6.1。

运行1 -使用re.compile

import re 
import time 
import fpformat
Regex1 = re.compile('^(a|b|c|d|e|f|g)+$') 
Regex2 = re.compile('^[a-g]+$')
TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    Regex1.search(TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    Regex2.search(TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.299 seconds
Character Class takes 0.107 seconds

运行2 -不使用re.compile

import re 
import time 
import fpformat

TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^(a|b|c|d|e|f|g)+$',TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^[a-g]+$',TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.508 seconds
Character Class takes 0.109 seconds

2010-01-17 21:22:18

用下面的例子:

h = re.compile('hello')
h.match('hello world')

上面例子中的匹配方法和下面的不一样:

re.match('hello', 'hello world')

Re.compile()返回一个正则表达式对象，这意味着h是一个正则表达式对象。

regex对象有自己的匹配方法，带有可选的pos和endpos参数:

的。匹配(字符串[线程][线程]])

pos

可选的第二个参数pos给出了字符串中的一个索引搜寻就要开始了;缺省值为0。这并不完全是相当于对字符串进行切片;'^'模式字符匹配于字符串的真正开始和在a之后的位置换行符，但不一定在搜索到的索引处开始。

尾部

可选参数endpos限制了字符串的长度搜索;这就好像字符串有endpos个字符那么长只搜索从pos到endpos - 1的字符匹配。如果endpos小于pos，则找不到匹配;否则, 如果rx是编译后的正则表达式对象，则rx。搜索(字符串,0, 50)等于rx。搜索(字符串(:50),0)。

regex对象的search、findall和finditer方法也支持这些参数。

Re.match (pattern, string, flags=0)不支持，如你所见，它的search、findall和finditer也没有。

match对象具有补充这些参数的属性:

match.pos

的search()或match()方法传递的pos的值一个正则表达式对象。这是正则表达式所在字符串的索引引擎开始寻找匹配。

match.endpos

传递给search()或match()方法的endpos值正则表达式对象的。对象超出的字符串的索引 RE引擎不会去。

一个regex对象有两个唯一的，可能有用的属性:

regex.groups

模式中捕获组的数量。

regex.groupindex

将(?P)定义的任何符号组名映射到的字典组数字。如果没有使用符号组，则字典为空在模式中。

最后，match对象有这个属性:

match.re

其match()或search()方法的正则表达式对象生成此匹配实例。

2013-03-10 23:03:59

值得使用Python的re.compile吗?

推荐文章

最新文章

标签