我找到了这个关于正则表达式的教程,虽然我直观地理解了“greedy”,“勉强”和“possession”限定词的作用,但我的理解中似乎有一个严重的漏洞。

具体来说,在下面的例子中:

Enter your regex: .*foo // Greedy qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo // Reluctant qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // Possessive qualifier
Enter input string to search: xfooxxxxxxfoo
No match found.

解释提到了吃掉整个输入字符串,字母被消耗,匹配器后退,最右边出现的“foo”被反流,等等。

不幸的是,尽管有这些美好的比喻,我还是不明白谁吃什么……您是否知道其他教程(简要地)解释正则表达式引擎如何工作?

或者,如果有人能用不同的措辞解释下一段,那将是非常感激的:

The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f", "o", "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f", "o", "o") have already been consumed [by whom?]. So the matcher slowly backs off [from right-to-left?] one letter at a time until the rightmost occurrence of "foo" has been regurgitated [what does this mean?], at which point the match succeeds and the search ends. The second example, however, is reluctant, so it starts by first consuming [by whom?] "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow [who swallows?] the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13. The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+ [how?], leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off [what does back off mean?]; it will outperform the equivalent greedy quantifier in cases where the match is not immediately found.


当前回答

我会试试看的。

A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less character (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step, making the greedy quantifier match one less character again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success!

A reluctant or "non-greedy" quantifier first matches as little as possible. So the .* matches nothing at first, leaving the entire string unmatched. Then the matcher tries to match the f following, but the unmatched portion of the string starts with "x" so that doesn't work. So the matcher backtracks, making the non-greedy quantifier match one more character (now it matches the "x", leaving "fooxxxxxxfoo" unmatched). Then it tries to match the f, which succeeds, and the o and the next o in the regex match too. Success!

在您的示例中,它将使用字符串的剩余未匹配部分“xxxxxxfoo”重新开始该过程,遵循相同的过程。

所有格量词就像贪婪量词一样,但它不会倒退。所以它以。*开始匹配整个字符串,没有不匹配的。那么它就没有什么可以与正则表达式中的f匹配了。由于所有格量词没有回溯,匹配就失败了。

其他回答

贪婪:“匹配最长的字符序列”

勉强:“匹配最短的字符序列”

所有格:这有点奇怪,因为它并不(与贪婪和不情愿相反)试图为整个正则表达式找到一个匹配。

顺便说一句:任何正则模式匹配器实现都不会使用回溯。所有现实生活中的模式匹配器都非常快——几乎与正则表达式的复杂性无关!

我以前没有听过确切的“反刍”或“后退”这两个词;取代这些的短语是“回溯”,但“反刍”似乎是一个很好的短语,用来表示“在回溯再次扔掉之前,暂时被接受的内容”。

关于大多数正则表达式引擎需要认识的重要一点是,它们正在回溯:它们将尝试性地接受一个潜在的部分匹配,同时尝试匹配正则表达式的整个内容。如果正则表达式在第一次尝试时不能完全匹配,则正则表达式引擎将在其中一个匹配上回溯。它将尝试以不同的方式匹配*,+,?,交替或{n,m}重复,并再次尝试。(是的,这个过程可能需要很长时间。)

第一个例子使用贪心 量词。*来找到“任何东西”,零 或者更多次,然后是信件 f, o, o。因为量词是 的。*部分 表达式首先吃掉整个输入 字符串。在这一点上,整体 表达不能成功,因为没有 最后三个字母(“f”“o”“o”)有 已经被消耗(被谁消耗?)

最后三个字母f、o和o已经被规则的开头.*部分占用了。然而,正则表达式中的下一个元素f在输入字符串中没有任何剩余。引擎将被迫返回最初的.*匹配,并尝试匹配除最后一个字符以外的所有字符。(这可能是聪明的做法,因为它有三个字面术语,但我不知道这个级别的实现细节。)

这个匹配器 慢慢后退(从右到左?)一次一个字母 直到最右边的出现 “foo”被反刍(这是什么意思?

This means the foo had tentatively been including when matching .*. Because that attempt failed, the regex engine tries accepting one fewer character in .*. If there had been a successful match before the .* in this example, then the engine would probably try shortening the .* match (from right-to-left, as you pointed out, because it is a greedy qualifier), and if it was unable to match the entire inputs, then it might be forced to re-evaluate what it had matched before the .* in my hypothetical example.

如果匹配成功,则 搜索结束。 然而,第二个例子是 不情愿,所以先下手为强 消费(由谁消费?)“没什么”。因为“foo”

最初的虚无被。?*,它将消耗允许正则表达式的其余部分匹配的尽可能短的量。

的开头没有出现 绳子,它被迫吞(谁吞?)了

又是。?*消耗第一个字符,在回溯初始失败后,用最短的可能匹配整个正则表达式。(在这种情况下,正则表达式引擎扩展匹配。*?从左到右,因为*?不情愿的)。

第一个字母(一个“x”),触发 第一场比赛在0和4。我们的测试 Harness继续此过程,直到 输入字符串已耗尽。它 在4点和13点又找到了匹配。 第三个例子没有找到a Match是因为量词是 所有格。在这种情况下,整个 输入字符串由。*+(如何使用?)

A .*+将消耗尽可能多的资源,并且当正则表达式作为一个整体无法找到一个匹配时,将不会返回去寻找新的匹配。因为所有格形式不执行回溯,所以您可能不会看到很多使用.*+的情况,而是使用字符类或类似的限制:account: [[:digit:]]*+ phone: [[:digit:]]*+。

This can drastically speed up regex matching, because you're telling the regex engine that it should never backtrack over potential matches if an input doesn't match. (If you had to write all the matching code by hand, this would be similar to never using putc(3) to 'push back' an input character. It would be very similar to the naive code one might write on a first try. Except regex engines are way better than a single character of push-back, they can rewind all the back to zero and try again. :)

但是,除了潜在的加速之外,这还可以让您编写与需要匹配的内容完全匹配的正则表达式。我很难想出一个简单的例子:)但是使用所有格和贪婪量词来编写一个正则表达式可以给你不同的匹配,其中一个可能更合适。

没有留下任何需要满足的东西 的末尾的“foo” 表达式。使用所有格 量词的情况,你 想要抓住所有没有的东西 后退(后退是什么意思?);它的表现会更好

在这里,“back off”意味着“回溯”——放弃一个试探性的部分匹配来尝试另一个部分匹配,这可能会成功,也可能不会成功。

中的等效贪婪量词 不匹配的情况 立即发现。

我会试试看的。

A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less character (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step, making the greedy quantifier match one less character again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success!

A reluctant or "non-greedy" quantifier first matches as little as possible. So the .* matches nothing at first, leaving the entire string unmatched. Then the matcher tries to match the f following, but the unmatched portion of the string starts with "x" so that doesn't work. So the matcher backtracks, making the non-greedy quantifier match one more character (now it matches the "x", leaving "fooxxxxxxfoo" unmatched). Then it tries to match the f, which succeeds, and the o and the next o in the regex match too. Success!

在您的示例中,它将使用字符串的剩余未匹配部分“xxxxxxfoo”重新开始该过程,遵循相同的过程。

所有格量词就像贪婪量词一样,但它不会倒退。所以它以。*开始匹配整个字符串,没有不匹配的。那么它就没有什么可以与正则表达式中的f匹配了。由于所有格量词没有回溯,匹配就失败了。

贪婪量化涉及在迭代期间使用字符串中所有剩余的未验证字符进行模式匹配。未验证的字符从活动序列开始。每次不匹配时,就隔离最后的字符,并再次执行检查。

当活动序列仅满足正则表达式模式的前导条件时,将尝试针对隔离验证其余条件。如果验证成功,则验证隔离区中匹配的字符,剩余的未匹配字符仍未验证,并将在下一次迭代中重新开始该流程时使用。

字符流是从活动序列进入隔离区的。结果的行为是尽可能多地将原始序列包含在匹配中。

不情愿量化与贪婪限定在很大程度上是相同的,除了字符流是相反的——也就是说,它们从隔离开始流入活动序列。结果是匹配中尽可能少地包含原始序列。

所有格量化没有隔离区,包括固定活动序列中的所有内容。

这只是我的实践输出可视化的场景-