我有时想匹配空白,而不是换行符。
到目前为止,我一直在使用[\t]。有不那么尴尬的方式吗?
我有时想匹配空白,而不是换行符。
到目前为止,我一直在使用[\t]。有不那么尴尬的方式吗?
当前回答
您要寻找的是POSIX空白字符类。在Perl中,它被引用为:
[[:blank:]]
在Java中(不要忘记启用UNICODE_CHARACTER_CLASS):
\p{Blank}
Compared to the similar \h, POSIX blank is supported by a few more regex engines (reference). A major benefit is that its definition is fixed in Annex C: Compatibility Properties of Unicode Regular Expressions and standard across all regex flavors that support Unicode. (In Perl, for example, \h chooses to additionally include the MONGOLIAN VOWEL SEPARATOR.) However, an argument in favor of \h is that it always detects Unicode characters (even if the engines don't agree on which), while POSIX character classes are often by default ASCII-only (as in Java).
但问题是,即使坚持使用Unicode也不能100%解决问题。考虑以下字符,它们在Unicode中不被视为空格:
U+ 180e蒙古语元音分隔符 U+ 200b零宽空间 U+ 200c零宽度非细木工 U+ 200d零宽细木工 U+2060字木工 U+ feff零宽度不间断空间 摘自https://en.wikipedia.org/wiki/White-space_character
The aforementioned Mongolian vowel separator isn't included for what is probably a good reason. It, along with 200C and 200D, occur within words (AFAIK), and therefore breaks the cardinal rule that all other whitespace obeys: you can tokenize with it. They're more like modifiers. However, ZERO WIDTH SPACE, WORD JOINER, and ZERO WIDTH NON-BREAKING SPACE (if it used as other than a byte-order mark) fit the whitespace rule in my book. Therefore, I include them in my horizontal whitespace character class.
在Java中:
static public final String HORIZONTAL_WHITESPACE = "[\\p{Blank}\\u200B\\u2060\\uFFEF]"
其他回答
您要寻找的是POSIX空白字符类。在Perl中,它被引用为:
[[:blank:]]
在Java中(不要忘记启用UNICODE_CHARACTER_CLASS):
\p{Blank}
Compared to the similar \h, POSIX blank is supported by a few more regex engines (reference). A major benefit is that its definition is fixed in Annex C: Compatibility Properties of Unicode Regular Expressions and standard across all regex flavors that support Unicode. (In Perl, for example, \h chooses to additionally include the MONGOLIAN VOWEL SEPARATOR.) However, an argument in favor of \h is that it always detects Unicode characters (even if the engines don't agree on which), while POSIX character classes are often by default ASCII-only (as in Java).
但问题是,即使坚持使用Unicode也不能100%解决问题。考虑以下字符,它们在Unicode中不被视为空格:
U+ 180e蒙古语元音分隔符 U+ 200b零宽空间 U+ 200c零宽度非细木工 U+ 200d零宽细木工 U+2060字木工 U+ feff零宽度不间断空间 摘自https://en.wikipedia.org/wiki/White-space_character
The aforementioned Mongolian vowel separator isn't included for what is probably a good reason. It, along with 200C and 200D, occur within words (AFAIK), and therefore breaks the cardinal rule that all other whitespace obeys: you can tokenize with it. They're more like modifiers. However, ZERO WIDTH SPACE, WORD JOINER, and ZERO WIDTH NON-BREAKING SPACE (if it used as other than a byte-order mark) fit the whitespace rule in my book. Therefore, I include them in my horizontal whitespace character class.
在Java中:
static public final String HORIZONTAL_WHITESPACE = "[\\p{Blank}\\u200B\\u2060\\uFFEF]"
M / /g只需在/ /中留出空间,就可以了。或者使用\S -它将替换所有特殊字符,如制表符、换行符、空格等等。
下面的正则表达式将匹配空格,但不匹配新行字符。
(?:(?!\n)\s)
DEMO
如果你想添加回车,那么在负前向中添加带|操作符的\r。
(?:(?![\n\r])\s)
DEMO
在非捕获组后添加+以匹配一个或多个空白。
(?:(?![\n\r])\s)+
DEMO
我不知道为什么你们没有提到POSIX字符类[[:blank:]],它匹配任何水平空白(空格和制表符)。这个POSIX字符类可以在BRE(基本正则表达式)、ERE(扩展正则表达式)、PCRE(Perl兼容正则表达式)上工作。
DEMO
将下面的正则表达式放在查找部分,并从“搜索模式”中选择正则表达式:
[^\S\r\n]+
Greg的答案也包含了回车:
/[^\S\r\n]/
这个正则表达式比没有\r的/[^\S\n]/更安全。我的理由是Windows使用\r\n作为换行符,而Mac OS 9使用\r。现在你不太可能找到不带\n的\r,但如果你确实找到了,它只能表示换行符。因此,既然\r可以表示换行符,我们也应该排除它。