我找到了这个关于正则表达式的教程,虽然我直观地理解了“greedy”,“勉强”和“possession”限定词的作用,但我的理解中似乎有一个严重的漏洞。

具体来说,在下面的例子中:

Enter your regex: .*foo // Greedy qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo // Reluctant qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // Possessive qualifier
Enter input string to search: xfooxxxxxxfoo
No match found.

解释提到了吃掉整个输入字符串,字母被消耗,匹配器后退,最右边出现的“foo”被反流,等等。

不幸的是,尽管有这些美好的比喻,我还是不明白谁吃什么……您是否知道其他教程(简要地)解释正则表达式引擎如何工作?

或者,如果有人能用不同的措辞解释下一段,那将是非常感激的:

The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f", "o", "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f", "o", "o") have already been consumed [by whom?]. So the matcher slowly backs off [from right-to-left?] one letter at a time until the rightmost occurrence of "foo" has been regurgitated [what does this mean?], at which point the match succeeds and the search ends. The second example, however, is reluctant, so it starts by first consuming [by whom?] "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow [who swallows?] the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13. The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+ [how?], leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off [what does back off mean?]; it will outperform the equivalent greedy quantifier in cases where the match is not immediately found.


我会试试看的。

A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less character (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step, making the greedy quantifier match one less character again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success!

A reluctant or "non-greedy" quantifier first matches as little as possible. So the .* matches nothing at first, leaving the entire string unmatched. Then the matcher tries to match the f following, but the unmatched portion of the string starts with "x" so that doesn't work. So the matcher backtracks, making the non-greedy quantifier match one more character (now it matches the "x", leaving "fooxxxxxxfoo" unmatched). Then it tries to match the f, which succeeds, and the o and the next o in the regex match too. Success!

在您的示例中,它将使用字符串的剩余未匹配部分“xxxxxxfoo”重新开始该过程,遵循相同的过程。

所有格量词就像贪婪量词一样,但它不会倒退。所以它以。*开始匹配整个字符串,没有不匹配的。那么它就没有什么可以与正则表达式中的f匹配了。由于所有格量词没有回溯,匹配就失败了。


我以前没有听过确切的“反刍”或“后退”这两个词;取代这些的短语是“回溯”,但“反刍”似乎是一个很好的短语,用来表示“在回溯再次扔掉之前,暂时被接受的内容”。

关于大多数正则表达式引擎需要认识的重要一点是,它们正在回溯:它们将尝试性地接受一个潜在的部分匹配,同时尝试匹配正则表达式的整个内容。如果正则表达式在第一次尝试时不能完全匹配,则正则表达式引擎将在其中一个匹配上回溯。它将尝试以不同的方式匹配*,+,?,交替或{n,m}重复,并再次尝试。(是的,这个过程可能需要很长时间。)

第一个例子使用贪心 量词。*来找到“任何东西”,零 或者更多次,然后是信件 f, o, o。因为量词是 的。*部分 表达式首先吃掉整个输入 字符串。在这一点上,整体 表达不能成功,因为没有 最后三个字母(“f”“o”“o”)有 已经被消耗(被谁消耗?)

最后三个字母f、o和o已经被规则的开头.*部分占用了。然而,正则表达式中的下一个元素f在输入字符串中没有任何剩余。引擎将被迫返回最初的.*匹配,并尝试匹配除最后一个字符以外的所有字符。(这可能是聪明的做法,因为它有三个字面术语,但我不知道这个级别的实现细节。)

这个匹配器 慢慢后退(从右到左?)一次一个字母 直到最右边的出现 “foo”被反刍(这是什么意思?

This means the foo had tentatively been including when matching .*. Because that attempt failed, the regex engine tries accepting one fewer character in .*. If there had been a successful match before the .* in this example, then the engine would probably try shortening the .* match (from right-to-left, as you pointed out, because it is a greedy qualifier), and if it was unable to match the entire inputs, then it might be forced to re-evaluate what it had matched before the .* in my hypothetical example.

如果匹配成功,则 搜索结束。 然而,第二个例子是 不情愿,所以先下手为强 消费(由谁消费?)“没什么”。因为“foo”

最初的虚无被。?*,它将消耗允许正则表达式的其余部分匹配的尽可能短的量。

的开头没有出现 绳子,它被迫吞(谁吞?)了

又是。?*消耗第一个字符,在回溯初始失败后,用最短的可能匹配整个正则表达式。(在这种情况下,正则表达式引擎扩展匹配。*?从左到右,因为*?不情愿的)。

第一个字母(一个“x”),触发 第一场比赛在0和4。我们的测试 Harness继续此过程,直到 输入字符串已耗尽。它 在4点和13点又找到了匹配。 第三个例子没有找到a Match是因为量词是 所有格。在这种情况下,整个 输入字符串由。*+(如何使用?)

A .*+将消耗尽可能多的资源,并且当正则表达式作为一个整体无法找到一个匹配时,将不会返回去寻找新的匹配。因为所有格形式不执行回溯,所以您可能不会看到很多使用.*+的情况,而是使用字符类或类似的限制:account: [[:digit:]]*+ phone: [[:digit:]]*+。

This can drastically speed up regex matching, because you're telling the regex engine that it should never backtrack over potential matches if an input doesn't match. (If you had to write all the matching code by hand, this would be similar to never using putc(3) to 'push back' an input character. It would be very similar to the naive code one might write on a first try. Except regex engines are way better than a single character of push-back, they can rewind all the back to zero and try again. :)

但是,除了潜在的加速之外,这还可以让您编写与需要匹配的内容完全匹配的正则表达式。我很难想出一个简单的例子:)但是使用所有格和贪婪量词来编写一个正则表达式可以给你不同的匹配,其中一个可能更合适。

没有留下任何需要满足的东西 的末尾的“foo” 表达式。使用所有格 量词的情况,你 想要抓住所有没有的东西 后退(后退是什么意思?);它的表现会更好

在这里,“back off”意味着“回溯”——放弃一个试探性的部分匹配来尝试另一个部分匹配,这可能会成功,也可能不会成功。

中的等效贪婪量词 不匹配的情况 立即发现。


http://swtch.com/~rsc/regexp/regexp1.html

我不确定这是互联网上最好的解释,但它写得相当好,细节也适当,我一直在想它。你可能会想去看看。

如果您想要更高级的(不太详细的解释),对于简单的正则表达式(例如您正在查看的正则表达式),正则表达式引擎通过回溯工作。从本质上讲,它选择(“吃掉”)字符串的一个部分,并尝试根据该部分匹配正则表达式。如果匹配,那就太好了。如果不是,引擎改变它对字符串部分的选择,并尝试与该部分匹配regexp,等等,直到尝试了所有可能的选择。

这个过程是递归地使用的:在尝试将字符串与给定的正则表达式匹配时,引擎将把正则表达式分割成块,并将算法单独应用于每个块。

贪婪量词、不情愿量词和所有格量词之间的区别出现在引擎选择匹配字符串的哪一部分时,以及如果第一次不成功如何修改该选择时。规则如下:

A greedy quantifier tells the engine to start with the entire string (or at least, all of it that hasn't already been matched by previous parts of the regular expression) and check whether it matches the regexp. If so, great; the engine can continue with the rest of the regexp. If not, it tries again, but trimming one character (the last one) off the section of the string to be checked. If that doesn't work, it trims off another character, etc. So a greedy quantifier checks possible matches in order from longest to shortest. A reluctant quantifier tells the engine to start with the shortest possible piece of the string. If it matches, the engine can continue; if not, it adds one character to the section of the string being checked and tries that, and so on until it finds a match or the entire string has been used up. So a reluctant quantifier checks possible matches in order from shortest to longest. A possessive quantifier is like a greedy quantifier on the first attempt: it tells the engine to start by checking the entire string. The difference is that if it doesn't work, the possessive quantifier reports that the match failed right then and there. The engine doesn't change the section of the string being looked at, and it doesn't make any more attempts.

This is why the possessive quantifier match fails in your example: the .*+ gets checked against the entire string, which it matches, but then the engine goes on to look for additional characters foo after that - but of course it doesn't find them, because you're already at the end of the string. If it were a greedy quantifier, it would backtrack and try making the .* only match up to the next-to-last character, then up to the third to last character, then up to the fourth to last character, which succeeds because only then is there a foo left after the .* has "eaten" the earlier part of the string.


贪婪:“匹配最长的字符序列”

勉强:“匹配最短的字符序列”

所有格:这有点奇怪,因为它并不(与贪婪和不情愿相反)试图为整个正则表达式找到一个匹配。

顺便说一句:任何正则模式匹配器实现都不会使用回溯。所有现实生活中的模式匹配器都非常快——几乎与正则表达式的复杂性无关!


下面是我使用单元格和索引位置(参见这里的图表来区分单元格和索引)。

贪婪-匹配尽可能多的贪婪量词和整个正则表达式。如果没有匹配,则回溯贪婪量词。

输入字符串:xfooxxxxxxfoo Regex: * foo

上面的Regex有两部分: (我)”。*’和 (2)“foo” 下面的每个步骤都将分析这两个部分。匹配“Pass”或“Fail”的附加注释在大括号内解释。

步骤1: (i) .* = xfooxxxxxxfoo - PASS('。*'是一个贪婪的量词,将使用整个输入字符串) (ii) foo =索引13后没有字符要匹配- FAIL 比赛失败了。

步骤2: (i) .* = xfooxxxxxxfo - PASS(贪心量词'.*'的回溯) (ii) foo = o - FAIL 比赛失败了。

步骤3: (i) .* = xfooxxxxxxf - PASS(贪心量词'.*'的回溯) (ii) foo = oo - FAIL 比赛失败了。

步骤4: (i) .* = xfooxxxxxx - PASS(贪心量词'.*'的回溯) (ii) foo = foo - PASS 报告匹配

结果:1个(es) 我找到了文本“xfooxxxxxxfoo”,从索引0开始,到索引13结束。

勉强-尽可能少地匹配勉强量词,并匹配整个正则表达式。如果没有匹配,则为不情愿量词添加字符。

输入字符串:xfooxxxxxxfoo Regex: . * ?喷火

上面的正则表达式有两部分: (我)”。* ?”, (2)“foo”

步骤1: . * ?= "(空白)- PASS(尽可能少匹配不情愿的量词'.*?'。索引0包含"是一个匹配。) foo = xfo - FAIL(单元格0,1,2 -即索引在0和3之间) 比赛失败了。

步骤2: . * ?= x - PASS(为不情愿量词'.*?'添加字符。单元格0有'x'是匹配的。) foo = foo - PASS 报告匹配

步骤3: . * ?= "(空白)- PASS(尽可能少匹配不情愿的量词'.*?'。索引4包含"是一个匹配。) foo = xxx - FAIL(单元格4,5,6 -即索引在4和7之间) 比赛失败了。

步骤4: . * ?= x - PASS(为不情愿量词'.*?'添加字符。细胞4。) foo = xxx - FAIL(单元格5,6,7 -即索引在5和8之间) 比赛失败了。

步骤5: . * ?= xx - PASS(为不情愿量词'.*?'添加字符。4号到5号单元) foo = xxx - FAIL(单元格6,7,8 -即索引在6和9之间) 比赛失败了。

步骤6: . * ?= xxx - PASS(为不情愿量词'.*?'添加字符。4号到6号牢房) foo = xxx - FAIL(单元格7,8,9 -即索引在7和10之间) 比赛失败了。

第七步: . * ?= xxxx - PASS(为不情愿量词'.*?'添加字符。4号到7号牢房) foo = xxf - FAIL(单元格8,9,10 -即索引在8和11之间) 比赛失败了。

第八步: . * ?= xxxxx - PASS(为不情愿量词'.*?'添加字符。4号到8号牢房) foo = xfo - FAIL(单元格9,10,11 -即索引在9和12之间) 比赛失败了。

步骤9: . * ?= xxxxxx - PASS(为不情愿量词'.*?'添加字符。4号至9号牢房) foo = foo - PASS(单元格10,11,12 -即索引在10和13之间) 报告匹配

第十步: . * ?= "(空白)- PASS(尽可能少匹配不情愿的量词'.*?'。索引13为空。) foo =没有字符需要匹配- FAIL(索引13之后没有任何字符需要匹配) 比赛失败了。

结果:2个(es) 我发现文本“xfoo”从索引0开始,在索引4结束。 我找到了文本“xxxxxxfoo”,从索引4开始,到索引13结束。

所有格-尽可能匹配所有格量词并匹配整个正则表达式。不要反悔。

输入字符串:xfooxxxxxxfoo Regex: . * + foo

上面的正则表达式有两部分:'。*+'和'foo'。

步骤1: .*+ = xfooxxxxxxfoo - PASS(尽可能匹配所有格量词'.*') foo =没有剩余字符匹配- FAIL(索引13后没有匹配) 比赛失败了。

注意:不允许回溯。

结果:0次匹配(es)


贪婪量化涉及在迭代期间使用字符串中所有剩余的未验证字符进行模式匹配。未验证的字符从活动序列开始。每次不匹配时,就隔离最后的字符,并再次执行检查。

当活动序列仅满足正则表达式模式的前导条件时,将尝试针对隔离验证其余条件。如果验证成功,则验证隔离区中匹配的字符,剩余的未匹配字符仍未验证,并将在下一次迭代中重新开始该流程时使用。

字符流是从活动序列进入隔离区的。结果的行为是尽可能多地将原始序列包含在匹配中。

不情愿量化与贪婪限定在很大程度上是相同的,除了字符流是相反的——也就是说,它们从隔离开始流入活动序列。结果是匹配中尽可能少地包含原始序列。

所有格量化没有隔离区,包括固定活动序列中的所有内容。


这只是我的实践输出可视化的场景-