我厌倦了总是试图猜测,如果我应该转义特殊字符,如'()[]{}|'等使用regexp的许多实现时。
它与Python、sed、grep、awk、Perl、rename、Apache、find等不同。 有没有什么规则集告诉我什么时候应该转义,什么时候不应该转义特殊字符?它是否依赖于regexp类型,如PCRE、POSIX或扩展的regexp ?
我厌倦了总是试图猜测,如果我应该转义特殊字符,如'()[]{}|'等使用regexp的许多实现时。
它与Python、sed、grep、awk、Perl、rename、Apache、find等不同。 有没有什么规则集告诉我什么时候应该转义,什么时候不应该转义特殊字符?它是否依赖于regexp类型,如PCRE、POSIX或扩展的regexp ?
当前回答
现代正则表达式口味(PCRE)
包括C、c++、Delphi、EditPad、Java、JavaScript、Perl、PHP (preg)、PostgreSQL、PowerGREP、PowerShell、Python、REALbasic、Real Studio、Ruby、TCL、VB。Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp。PCRE兼容性可能有所不同
不会后悔:。^ $ * + - ?( ) [ ] { } \ |
传统RegEx口味(BRE/ERE)
包括awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed。PCRE支持可以在后续版本中启用或通过使用扩展启用
纪念awk / egrep / emacs
在字符类之外:。^ $ * + ?() [{} \ | . 在字符类中:^ - []
BRE / ed / grep和sed
在字符类之外:。^ $ * [\ 在字符类中:^ - [] 对于字面量,不要转义:+ ?() {} | 对于标准的正则表达式行为,转义:\+ \?\(\) \{\} \|
笔记
If unsure about a specific character, it can be escaped like \xFF Alphanumeric characters cannot be escaped with a backslash Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript) Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live
其他回答
不幸的是,确实没有一组转义码,因为它根据您使用的语言而变化。
然而,保留一个像正则表达式工具页面或这个正则表达式小抄表这样的页面可以帮助你快速过滤东西。
POSIX识别正则表达式的多种变体——基本正则表达式(BRE)和扩展正则表达式(ERE)。即使这样,由于POSIX标准化的实用程序的历史实现,也存在一些怪癖。
对于何时使用哪种符号,甚至给定命令使用哪种符号,并没有一个简单的规则。
看看Jeff Friedl的《精通正则表达式》这本书。
要想准确地理解字符串所经过的上下文链,就必须知道何时以及在不进行尝试的情况下进行转义。您将指定从最远的一端到最终目的地(regexp解析代码处理的内存)的字符串。
注意内存中的字符串是如何处理的:if可以是代码中的普通字符串,也可以是输入到命令行的字符串,但a可以是交互式命令行,也可以是shell脚本文件中声明的命令行,也可以是代码中提到的内存中的变量,或者是通过进一步求值的(字符串)参数,或者包含任何类型封装的动态生成的代码的字符串……
每个上下文都赋予了一些具有特殊功能的字符。
When you want to pass the character literally without using its special function (local to the context), than that's the case you have to escape it, for the next context... which might need some other escape characters which might additionally need to be escaped in the preceding context(s). Furthermore there can be things like character encoding (the most insidious is utf-8 because it look like ASCII for common characters, but might be optionally interpreted even by the terminal depending on its settings so it might behave differently, then the encoding attribute of HTML/XML, it's necessary to understand the process precisely right.
E.g. A regexp in the command line starting with perl -npe, needs to be transferred to a set of exec system calls connecting as pipe the file handles, each of this exec system calls just has a list of arguments that were separated by (non escaped)spaces, and possibly pipes(|) and redirection (> N> N>&M), parenthesis, interactive expansion of * and ?, $(()) ... (all this are special characters used by the *sh which might appear to interfere with the character of the regular expression in the next context, but they are evaluated in order: before the command line. The command line is read by a program as bash/sh/csh/tcsh/zsh, essentially inside double quote or single quote the escape is simpler but it is not necessary to quote a string in the command line because mostly the space has to be prefixed with backslash and the quote are not necessary leaving available the expand functionality for characters * and ?, but this parse as different context as within quote. Then when the command line is evaluated the regexp obtained in memory (not as written in the command line) receives the same treatment as it would be in a source file. For regexp there is character-set context within square brackets [ ], perl regular expression can be quoted by a large set of non alfa-numeric characters (E.g. m// or m:/better/for/path: ...).
在其他答案中有关于字符的更多细节,它们非常特定于最终的regexp上下文。正如我注意到的,您提到您发现regexp转义带有尝试,这可能是因为不同的上下文有不同的字符集,这使您对尝试的记忆混淆了(通常在这些不同的上下文中使用反斜杠来转义文字字符,而不是其函数)。
对于Ionic (Typescript),你必须用双斜杠来转义字符。 例如(这是为了匹配一些特殊字符):
"^(?=.*[\\]\\[!¡\'=ªº\\-\\_ç@#$%^&*(),;\\.?\":{}|<>\+\\/])"
注意这个]- _。/字符。它们必须被一分为二。如果不这样做,代码中就会出现类型错误。
https://perldoc.perl.org/perlre.html#Quoting-metacharacters和https://perldoc.perl.org/functions/quotemeta.html
在官方文档中,这样的字符称为元字符。引用的例子:
my $regex = quotemeta($string)
s/$regex/something/