我找到了这个关于正则表达式的教程,虽然我直观地理解了“greedy”,“勉强”和“possession”限定词的作用,但我的理解中似乎有一个严重的漏洞。

具体来说,在下面的例子中:

Enter your regex: .*foo // Greedy qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo // Reluctant qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // Possessive qualifier
Enter input string to search: xfooxxxxxxfoo
No match found.

解释提到了吃掉整个输入字符串,字母被消耗,匹配器后退,最右边出现的“foo”被反流,等等。

不幸的是,尽管有这些美好的比喻,我还是不明白谁吃什么……您是否知道其他教程(简要地)解释正则表达式引擎如何工作?

或者,如果有人能用不同的措辞解释下一段,那将是非常感激的:

The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f", "o", "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f", "o", "o") have already been consumed [by whom?]. So the matcher slowly backs off [from right-to-left?] one letter at a time until the rightmost occurrence of "foo" has been regurgitated [what does this mean?], at which point the match succeeds and the search ends. The second example, however, is reluctant, so it starts by first consuming [by whom?] "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow [who swallows?] the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13. The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+ [how?], leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off [what does back off mean?]; it will outperform the equivalent greedy quantifier in cases where the match is not immediately found.


当前回答

我以前没有听过确切的“反刍”或“后退”这两个词;取代这些的短语是“回溯”,但“反刍”似乎是一个很好的短语,用来表示“在回溯再次扔掉之前,暂时被接受的内容”。

关于大多数正则表达式引擎需要认识的重要一点是,它们正在回溯:它们将尝试性地接受一个潜在的部分匹配,同时尝试匹配正则表达式的整个内容。如果正则表达式在第一次尝试时不能完全匹配,则正则表达式引擎将在其中一个匹配上回溯。它将尝试以不同的方式匹配*,+,?,交替或{n,m}重复,并再次尝试。(是的,这个过程可能需要很长时间。)

第一个例子使用贪心 量词。*来找到“任何东西”,零 或者更多次,然后是信件 f, o, o。因为量词是 的。*部分 表达式首先吃掉整个输入 字符串。在这一点上,整体 表达不能成功,因为没有 最后三个字母(“f”“o”“o”)有 已经被消耗(被谁消耗?)

最后三个字母f、o和o已经被规则的开头.*部分占用了。然而,正则表达式中的下一个元素f在输入字符串中没有任何剩余。引擎将被迫返回最初的.*匹配,并尝试匹配除最后一个字符以外的所有字符。(这可能是聪明的做法,因为它有三个字面术语,但我不知道这个级别的实现细节。)

这个匹配器 慢慢后退(从右到左?)一次一个字母 直到最右边的出现 “foo”被反刍(这是什么意思?

This means the foo had tentatively been including when matching .*. Because that attempt failed, the regex engine tries accepting one fewer character in .*. If there had been a successful match before the .* in this example, then the engine would probably try shortening the .* match (from right-to-left, as you pointed out, because it is a greedy qualifier), and if it was unable to match the entire inputs, then it might be forced to re-evaluate what it had matched before the .* in my hypothetical example.

如果匹配成功,则 搜索结束。 然而,第二个例子是 不情愿,所以先下手为强 消费(由谁消费?)“没什么”。因为“foo”

最初的虚无被。?*,它将消耗允许正则表达式的其余部分匹配的尽可能短的量。

的开头没有出现 绳子,它被迫吞(谁吞?)了

又是。?*消耗第一个字符,在回溯初始失败后,用最短的可能匹配整个正则表达式。(在这种情况下,正则表达式引擎扩展匹配。*?从左到右,因为*?不情愿的)。

第一个字母(一个“x”),触发 第一场比赛在0和4。我们的测试 Harness继续此过程,直到 输入字符串已耗尽。它 在4点和13点又找到了匹配。 第三个例子没有找到a Match是因为量词是 所有格。在这种情况下,整个 输入字符串由。*+(如何使用?)

A .*+将消耗尽可能多的资源,并且当正则表达式作为一个整体无法找到一个匹配时,将不会返回去寻找新的匹配。因为所有格形式不执行回溯,所以您可能不会看到很多使用.*+的情况,而是使用字符类或类似的限制:account: [[:digit:]]*+ phone: [[:digit:]]*+。

This can drastically speed up regex matching, because you're telling the regex engine that it should never backtrack over potential matches if an input doesn't match. (If you had to write all the matching code by hand, this would be similar to never using putc(3) to 'push back' an input character. It would be very similar to the naive code one might write on a first try. Except regex engines are way better than a single character of push-back, they can rewind all the back to zero and try again. :)

但是,除了潜在的加速之外,这还可以让您编写与需要匹配的内容完全匹配的正则表达式。我很难想出一个简单的例子:)但是使用所有格和贪婪量词来编写一个正则表达式可以给你不同的匹配,其中一个可能更合适。

没有留下任何需要满足的东西 的末尾的“foo” 表达式。使用所有格 量词的情况,你 想要抓住所有没有的东西 后退(后退是什么意思?);它的表现会更好

在这里,“back off”意味着“回溯”——放弃一个试探性的部分匹配来尝试另一个部分匹配,这可能会成功,也可能不会成功。

中的等效贪婪量词 不匹配的情况 立即发现。

其他回答

http://swtch.com/~rsc/regexp/regexp1.html

我不确定这是互联网上最好的解释,但它写得相当好,细节也适当,我一直在想它。你可能会想去看看。

如果您想要更高级的(不太详细的解释),对于简单的正则表达式(例如您正在查看的正则表达式),正则表达式引擎通过回溯工作。从本质上讲,它选择(“吃掉”)字符串的一个部分,并尝试根据该部分匹配正则表达式。如果匹配,那就太好了。如果不是,引擎改变它对字符串部分的选择,并尝试与该部分匹配regexp,等等,直到尝试了所有可能的选择。

这个过程是递归地使用的:在尝试将字符串与给定的正则表达式匹配时,引擎将把正则表达式分割成块,并将算法单独应用于每个块。

贪婪量词、不情愿量词和所有格量词之间的区别出现在引擎选择匹配字符串的哪一部分时,以及如果第一次不成功如何修改该选择时。规则如下:

A greedy quantifier tells the engine to start with the entire string (or at least, all of it that hasn't already been matched by previous parts of the regular expression) and check whether it matches the regexp. If so, great; the engine can continue with the rest of the regexp. If not, it tries again, but trimming one character (the last one) off the section of the string to be checked. If that doesn't work, it trims off another character, etc. So a greedy quantifier checks possible matches in order from longest to shortest. A reluctant quantifier tells the engine to start with the shortest possible piece of the string. If it matches, the engine can continue; if not, it adds one character to the section of the string being checked and tries that, and so on until it finds a match or the entire string has been used up. So a reluctant quantifier checks possible matches in order from shortest to longest. A possessive quantifier is like a greedy quantifier on the first attempt: it tells the engine to start by checking the entire string. The difference is that if it doesn't work, the possessive quantifier reports that the match failed right then and there. The engine doesn't change the section of the string being looked at, and it doesn't make any more attempts.

This is why the possessive quantifier match fails in your example: the .*+ gets checked against the entire string, which it matches, but then the engine goes on to look for additional characters foo after that - but of course it doesn't find them, because you're already at the end of the string. If it were a greedy quantifier, it would backtrack and try making the .* only match up to the next-to-last character, then up to the third to last character, then up to the fourth to last character, which succeeds because only then is there a foo left after the .* has "eaten" the earlier part of the string.

贪婪量化涉及在迭代期间使用字符串中所有剩余的未验证字符进行模式匹配。未验证的字符从活动序列开始。每次不匹配时,就隔离最后的字符,并再次执行检查。

当活动序列仅满足正则表达式模式的前导条件时,将尝试针对隔离验证其余条件。如果验证成功,则验证隔离区中匹配的字符,剩余的未匹配字符仍未验证,并将在下一次迭代中重新开始该流程时使用。

字符流是从活动序列进入隔离区的。结果的行为是尽可能多地将原始序列包含在匹配中。

不情愿量化与贪婪限定在很大程度上是相同的,除了字符流是相反的——也就是说,它们从隔离开始流入活动序列。结果是匹配中尽可能少地包含原始序列。

所有格量化没有隔离区,包括固定活动序列中的所有内容。

我以前没有听过确切的“反刍”或“后退”这两个词;取代这些的短语是“回溯”,但“反刍”似乎是一个很好的短语,用来表示“在回溯再次扔掉之前,暂时被接受的内容”。

关于大多数正则表达式引擎需要认识的重要一点是,它们正在回溯:它们将尝试性地接受一个潜在的部分匹配,同时尝试匹配正则表达式的整个内容。如果正则表达式在第一次尝试时不能完全匹配,则正则表达式引擎将在其中一个匹配上回溯。它将尝试以不同的方式匹配*,+,?,交替或{n,m}重复,并再次尝试。(是的,这个过程可能需要很长时间。)

第一个例子使用贪心 量词。*来找到“任何东西”,零 或者更多次,然后是信件 f, o, o。因为量词是 的。*部分 表达式首先吃掉整个输入 字符串。在这一点上,整体 表达不能成功,因为没有 最后三个字母(“f”“o”“o”)有 已经被消耗(被谁消耗?)

最后三个字母f、o和o已经被规则的开头.*部分占用了。然而,正则表达式中的下一个元素f在输入字符串中没有任何剩余。引擎将被迫返回最初的.*匹配,并尝试匹配除最后一个字符以外的所有字符。(这可能是聪明的做法,因为它有三个字面术语,但我不知道这个级别的实现细节。)

这个匹配器 慢慢后退(从右到左?)一次一个字母 直到最右边的出现 “foo”被反刍(这是什么意思?

This means the foo had tentatively been including when matching .*. Because that attempt failed, the regex engine tries accepting one fewer character in .*. If there had been a successful match before the .* in this example, then the engine would probably try shortening the .* match (from right-to-left, as you pointed out, because it is a greedy qualifier), and if it was unable to match the entire inputs, then it might be forced to re-evaluate what it had matched before the .* in my hypothetical example.

如果匹配成功,则 搜索结束。 然而,第二个例子是 不情愿,所以先下手为强 消费(由谁消费?)“没什么”。因为“foo”

最初的虚无被。?*,它将消耗允许正则表达式的其余部分匹配的尽可能短的量。

的开头没有出现 绳子,它被迫吞(谁吞?)了

又是。?*消耗第一个字符,在回溯初始失败后,用最短的可能匹配整个正则表达式。(在这种情况下,正则表达式引擎扩展匹配。*?从左到右,因为*?不情愿的)。

第一个字母(一个“x”),触发 第一场比赛在0和4。我们的测试 Harness继续此过程,直到 输入字符串已耗尽。它 在4点和13点又找到了匹配。 第三个例子没有找到a Match是因为量词是 所有格。在这种情况下,整个 输入字符串由。*+(如何使用?)

A .*+将消耗尽可能多的资源,并且当正则表达式作为一个整体无法找到一个匹配时,将不会返回去寻找新的匹配。因为所有格形式不执行回溯,所以您可能不会看到很多使用.*+的情况,而是使用字符类或类似的限制:account: [[:digit:]]*+ phone: [[:digit:]]*+。

This can drastically speed up regex matching, because you're telling the regex engine that it should never backtrack over potential matches if an input doesn't match. (If you had to write all the matching code by hand, this would be similar to never using putc(3) to 'push back' an input character. It would be very similar to the naive code one might write on a first try. Except regex engines are way better than a single character of push-back, they can rewind all the back to zero and try again. :)

但是,除了潜在的加速之外,这还可以让您编写与需要匹配的内容完全匹配的正则表达式。我很难想出一个简单的例子:)但是使用所有格和贪婪量词来编写一个正则表达式可以给你不同的匹配,其中一个可能更合适。

没有留下任何需要满足的东西 的末尾的“foo” 表达式。使用所有格 量词的情况,你 想要抓住所有没有的东西 后退(后退是什么意思?);它的表现会更好

在这里,“back off”意味着“回溯”——放弃一个试探性的部分匹配来尝试另一个部分匹配,这可能会成功,也可能不会成功。

中的等效贪婪量词 不匹配的情况 立即发现。

这只是我的实践输出可视化的场景-

贪婪:“匹配最长的字符序列”

勉强:“匹配最短的字符序列”

所有格:这有点奇怪,因为它并不(与贪婪和不情愿相反)试图为整个正则表达式找到一个匹配。

顺便说一句:任何正则模式匹配器实现都不会使用回溯。所有现实生活中的模式匹配器都非常快——几乎与正则表达式的复杂性无关!