我找到了这个关于正则表达式的教程,虽然我直观地理解了“greedy”,“勉强”和“possession”限定词的作用,但我的理解中似乎有一个严重的漏洞。

具体来说,在下面的例子中:

Enter your regex: .*foo // Greedy qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo // Reluctant qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // Possessive qualifier
Enter input string to search: xfooxxxxxxfoo
No match found.

解释提到了吃掉整个输入字符串,字母被消耗,匹配器后退,最右边出现的“foo”被反流,等等。

不幸的是,尽管有这些美好的比喻,我还是不明白谁吃什么……您是否知道其他教程(简要地)解释正则表达式引擎如何工作?

或者,如果有人能用不同的措辞解释下一段,那将是非常感激的:

The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f", "o", "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f", "o", "o") have already been consumed [by whom?]. So the matcher slowly backs off [from right-to-left?] one letter at a time until the rightmost occurrence of "foo" has been regurgitated [what does this mean?], at which point the match succeeds and the search ends. The second example, however, is reluctant, so it starts by first consuming [by whom?] "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow [who swallows?] the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13. The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+ [how?], leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off [what does back off mean?]; it will outperform the equivalent greedy quantifier in cases where the match is not immediately found.


当前回答

这只是我的实践输出可视化的场景-

其他回答

贪婪:“匹配最长的字符序列”

勉强:“匹配最短的字符序列”

所有格:这有点奇怪,因为它并不(与贪婪和不情愿相反)试图为整个正则表达式找到一个匹配。

顺便说一句:任何正则模式匹配器实现都不会使用回溯。所有现实生活中的模式匹配器都非常快——几乎与正则表达式的复杂性无关!

http://swtch.com/~rsc/regexp/regexp1.html

我不确定这是互联网上最好的解释,但它写得相当好,细节也适当,我一直在想它。你可能会想去看看。

如果您想要更高级的(不太详细的解释),对于简单的正则表达式(例如您正在查看的正则表达式),正则表达式引擎通过回溯工作。从本质上讲,它选择(“吃掉”)字符串的一个部分,并尝试根据该部分匹配正则表达式。如果匹配,那就太好了。如果不是,引擎改变它对字符串部分的选择,并尝试与该部分匹配regexp,等等,直到尝试了所有可能的选择。

这个过程是递归地使用的:在尝试将字符串与给定的正则表达式匹配时,引擎将把正则表达式分割成块,并将算法单独应用于每个块。

贪婪量词、不情愿量词和所有格量词之间的区别出现在引擎选择匹配字符串的哪一部分时,以及如果第一次不成功如何修改该选择时。规则如下:

A greedy quantifier tells the engine to start with the entire string (or at least, all of it that hasn't already been matched by previous parts of the regular expression) and check whether it matches the regexp. If so, great; the engine can continue with the rest of the regexp. If not, it tries again, but trimming one character (the last one) off the section of the string to be checked. If that doesn't work, it trims off another character, etc. So a greedy quantifier checks possible matches in order from longest to shortest. A reluctant quantifier tells the engine to start with the shortest possible piece of the string. If it matches, the engine can continue; if not, it adds one character to the section of the string being checked and tries that, and so on until it finds a match or the entire string has been used up. So a reluctant quantifier checks possible matches in order from shortest to longest. A possessive quantifier is like a greedy quantifier on the first attempt: it tells the engine to start by checking the entire string. The difference is that if it doesn't work, the possessive quantifier reports that the match failed right then and there. The engine doesn't change the section of the string being looked at, and it doesn't make any more attempts.

This is why the possessive quantifier match fails in your example: the .*+ gets checked against the entire string, which it matches, but then the engine goes on to look for additional characters foo after that - but of course it doesn't find them, because you're already at the end of the string. If it were a greedy quantifier, it would backtrack and try making the .* only match up to the next-to-last character, then up to the third to last character, then up to the fourth to last character, which succeeds because only then is there a foo left after the .* has "eaten" the earlier part of the string.

贪婪量化涉及在迭代期间使用字符串中所有剩余的未验证字符进行模式匹配。未验证的字符从活动序列开始。每次不匹配时,就隔离最后的字符,并再次执行检查。

当活动序列仅满足正则表达式模式的前导条件时,将尝试针对隔离验证其余条件。如果验证成功,则验证隔离区中匹配的字符,剩余的未匹配字符仍未验证,并将在下一次迭代中重新开始该流程时使用。

字符流是从活动序列进入隔离区的。结果的行为是尽可能多地将原始序列包含在匹配中。

不情愿量化与贪婪限定在很大程度上是相同的,除了字符流是相反的——也就是说,它们从隔离开始流入活动序列。结果是匹配中尽可能少地包含原始序列。

所有格量化没有隔离区,包括固定活动序列中的所有内容。

这只是我的实践输出可视化的场景-

下面是我使用单元格和索引位置(参见这里的图表来区分单元格和索引)。

贪婪-匹配尽可能多的贪婪量词和整个正则表达式。如果没有匹配,则回溯贪婪量词。

输入字符串:xfooxxxxxxfoo Regex: * foo

上面的Regex有两部分: (我)”。*’和 (2)“foo” 下面的每个步骤都将分析这两个部分。匹配“Pass”或“Fail”的附加注释在大括号内解释。

步骤1: (i) .* = xfooxxxxxxfoo - PASS('。*'是一个贪婪的量词,将使用整个输入字符串) (ii) foo =索引13后没有字符要匹配- FAIL 比赛失败了。

步骤2: (i) .* = xfooxxxxxxfo - PASS(贪心量词'.*'的回溯) (ii) foo = o - FAIL 比赛失败了。

步骤3: (i) .* = xfooxxxxxxf - PASS(贪心量词'.*'的回溯) (ii) foo = oo - FAIL 比赛失败了。

步骤4: (i) .* = xfooxxxxxx - PASS(贪心量词'.*'的回溯) (ii) foo = foo - PASS 报告匹配

结果:1个(es) 我找到了文本“xfooxxxxxxfoo”,从索引0开始,到索引13结束。

勉强-尽可能少地匹配勉强量词,并匹配整个正则表达式。如果没有匹配,则为不情愿量词添加字符。

输入字符串:xfooxxxxxxfoo Regex: . * ?喷火

上面的正则表达式有两部分: (我)”。* ?”, (2)“foo”

步骤1: . * ?= "(空白)- PASS(尽可能少匹配不情愿的量词'.*?'。索引0包含"是一个匹配。) foo = xfo - FAIL(单元格0,1,2 -即索引在0和3之间) 比赛失败了。

步骤2: . * ?= x - PASS(为不情愿量词'.*?'添加字符。单元格0有'x'是匹配的。) foo = foo - PASS 报告匹配

步骤3: . * ?= "(空白)- PASS(尽可能少匹配不情愿的量词'.*?'。索引4包含"是一个匹配。) foo = xxx - FAIL(单元格4,5,6 -即索引在4和7之间) 比赛失败了。

步骤4: . * ?= x - PASS(为不情愿量词'.*?'添加字符。细胞4。) foo = xxx - FAIL(单元格5,6,7 -即索引在5和8之间) 比赛失败了。

步骤5: . * ?= xx - PASS(为不情愿量词'.*?'添加字符。4号到5号单元) foo = xxx - FAIL(单元格6,7,8 -即索引在6和9之间) 比赛失败了。

步骤6: . * ?= xxx - PASS(为不情愿量词'.*?'添加字符。4号到6号牢房) foo = xxx - FAIL(单元格7,8,9 -即索引在7和10之间) 比赛失败了。

第七步: . * ?= xxxx - PASS(为不情愿量词'.*?'添加字符。4号到7号牢房) foo = xxf - FAIL(单元格8,9,10 -即索引在8和11之间) 比赛失败了。

第八步: . * ?= xxxxx - PASS(为不情愿量词'.*?'添加字符。4号到8号牢房) foo = xfo - FAIL(单元格9,10,11 -即索引在9和12之间) 比赛失败了。

步骤9: . * ?= xxxxxx - PASS(为不情愿量词'.*?'添加字符。4号至9号牢房) foo = foo - PASS(单元格10,11,12 -即索引在10和13之间) 报告匹配

第十步: . * ?= "(空白)- PASS(尽可能少匹配不情愿的量词'.*?'。索引13为空。) foo =没有字符需要匹配- FAIL(索引13之后没有任何字符需要匹配) 比赛失败了。

结果:2个(es) 我发现文本“xfoo”从索引0开始,在索引4结束。 我找到了文本“xxxxxxfoo”,从索引4开始,到索引13结束。

所有格-尽可能匹配所有格量词并匹配整个正则表达式。不要反悔。

输入字符串:xfooxxxxxxfoo Regex: . * + foo

上面的正则表达式有两部分:'。*+'和'foo'。

步骤1: .*+ = xfooxxxxxxfoo - PASS(尽可能匹配所有格量词'.*') foo =没有剩余字符匹配- FAIL(索引13后没有匹配) 比赛失败了。

注意:不允许回溯。

结果:0次匹配(es)