我想使用来自用户的输入作为正则表达式模式搜索一些文本。它的工作,但我如何能处理的情况下,用户把有意义的字符在正则表达式?
例如,用户想要搜索Word: regex引擎将把这些单词作为一个组。我想把它当成字符串"(s)". 我可以运行替换用户输入和替换(与\(和)与\),但问题是,我将需要做替换为每一个可能的正则表达式符号。
你知道更好的办法吗?
我想使用来自用户的输入作为正则表达式模式搜索一些文本。它的工作,但我如何能处理的情况下,用户把有意义的字符在正则表达式?
例如,用户想要搜索Word: regex引擎将把这些单词作为一个组。我想把它当成字符串"(s)". 我可以运行替换用户输入和替换(与\(和)与\),但问题是,我将需要做替换为每一个可能的正则表达式符号。
你知道更好的办法吗?
使用re.escape()函数:
4.2.3 re模块内容
逃避(字符串) 返回所有非字母数字的反划字符串;如果您想要匹配任意文本字符串,其中可能包含正则表达式元字符,这是非常有用的。
这是一个简单的例子,搜索所提供的字符串中任何跟在's'后面的选项,并返回匹配对象。
def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)
你可以使用re.escape():
re.escape(字符串) 返回所有非字母数字的反划字符串;如果您想要匹配任意文本字符串,其中可能包含正则表达式元字符,这是非常有用的。
>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'
如果您使用的是Python版本< 3.7,这将转义不属于正则表达式语法的非字母数字。
如果您使用的是Python版本< 3.7但>= 3.3,这将转义不属于正则表达式语法的非字母数字,除了特别的下划线(_)。
不幸的是,re.escape()不适合替换字符串:
>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'
一个解决方案是将替换放在lambda中:
>>> re.sub('a', lambda _: '_', 'aa')
'__'
因为lambda的返回值被re.sub()作为文字字符串处理。
Usually escaping the string that you feed into a regex is such that the regex considers those characters literally. Remember usually you type strings into your compuer and the computer insert the specific characters. When you see in your editor \n it's not really a new line until the parser decides it is. It's two characters. Once you pass it through python's print will display it and thus parse it as a new a line but in the text you see in the editor it's likely just the char for backslash followed by n. If you do \r"\n" then python will always interpret it as the raw thing you typed in (as far as I understand). To complicate things further there is another syntax/grammar going on with regexes. The regex parser will interpret the strings it's receives differently than python's print would. I believe this is why we are recommended to pass raw strings like r"(\n+) -- so that the regex receives what you actually typed. However, the regex will receive a parenthesis and won't match it as a literal parenthesis unless you tell it to explicitly using the regex's own syntax rules. For that you need r"(\fun \( x : nat \) :)" here the first parens won't be matched since it's a capture group due to lack of backslashes but the second one will be matched as literal parens.
因此,我们通常使用re.escape(regex)来转义我们想字面上解释的东西,即通常会被regex解析器忽略的东西,如paren,空格等将被转义。例如,我在我的应用程序代码:
# escapes non-alphanumeric to help match arbitrary literal string, I think the reason this is here is to help differentiate the things escaped from the regex we are inserting in the next line and the literal things we wanted escaped.
__ppt = re.escape(_ppt) # used for e.g. parenthesis ( are not interpreted as was to group this but literally
例如,看看这些字符串:
_ppt
Out[4]: '(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
__ppt
Out[5]: '\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
print(rf'{_ppt=}')
_ppt='(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
print(rf'{__ppt=}')
__ppt='\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
我相信双反斜杠是存在的,这样正则表达式就会接收到一个字面的反斜杠。
顺便说一句,我很惊讶它打印了双反斜杠而不是一个。如果有人能对此发表评论,将不胜感激。我也很好奇如何匹配正则表达式中的字面反斜杠。我假设它是4个反斜杠,但我老实说,由于原始字符串r结构,只需要2个反斜杠。