我有时想匹配空白,而不是换行符。
到目前为止,我一直在使用[\t]。有不那么尴尬的方式吗?
我有时想匹配空白,而不是换行符。
到目前为止,我一直在使用[\t]。有不那么尴尬的方式吗?
当前回答
将下面的正则表达式放在查找部分,并从“搜索模式”中选择正则表达式:
[^\S\r\n]+
其他回答
Greg的答案也包含了回车:
/[^\S\r\n]/
这个正则表达式比没有\r的/[^\S\n]/更安全。我的理由是Windows使用\r\n作为换行符,而Mac OS 9使用\r。现在你不太可能找到不带\n的\r,但如果你确实找到了,它只能表示换行符。因此,既然\r可以表示换行符,我们也应该排除它。
您要寻找的是POSIX空白字符类。在Perl中,它被引用为:
[[:blank:]]
在Java中(不要忘记启用UNICODE_CHARACTER_CLASS):
\p{Blank}
Compared to the similar \h, POSIX blank is supported by a few more regex engines (reference). A major benefit is that its definition is fixed in Annex C: Compatibility Properties of Unicode Regular Expressions and standard across all regex flavors that support Unicode. (In Perl, for example, \h chooses to additionally include the MONGOLIAN VOWEL SEPARATOR.) However, an argument in favor of \h is that it always detects Unicode characters (even if the engines don't agree on which), while POSIX character classes are often by default ASCII-only (as in Java).
但问题是,即使坚持使用Unicode也不能100%解决问题。考虑以下字符,它们在Unicode中不被视为空格:
U+ 180e蒙古语元音分隔符 U+ 200b零宽空间 U+ 200c零宽度非细木工 U+ 200d零宽细木工 U+2060字木工 U+ feff零宽度不间断空间 摘自https://en.wikipedia.org/wiki/White-space_character
The aforementioned Mongolian vowel separator isn't included for what is probably a good reason. It, along with 200C and 200D, occur within words (AFAIK), and therefore breaks the cardinal rule that all other whitespace obeys: you can tokenize with it. They're more like modifiers. However, ZERO WIDTH SPACE, WORD JOINER, and ZERO WIDTH NON-BREAKING SPACE (if it used as other than a byte-order mark) fit the whitespace rule in my book. Therefore, I include them in my horizontal whitespace character class.
在Java中:
static public final String HORIZONTAL_WHITESPACE = "[\\p{Blank}\\u200B\\u2060\\uFFEF]"
将下面的正则表达式放在查找部分,并从“搜索模式”中选择正则表达式:
[^\S\r\n]+
Perl版本5.10及更高版本支持附属的垂直和水平字符类\v和\h,以及通用的空白字符类\s
最简洁的解决方案是使用水平空白字符类\h。这将匹配ASCII集中的制表符和空格、扩展ASCII中的非换行空格或任何这些Unicode字符
U+0009 CHARACTER TABULATION
U+0020 SPACE
U+00A0 NO-BREAK SPACE (not matched by \s)
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE
垂直空格模式\v用处不大,但与这些字符匹配
U+000A LINE FEED
U+000B LINE TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0085 NEXT LINE (not matched by \s)
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
有7个垂直空白字符与\v匹配,18个水平空白字符与\h匹配。匹配23个字符
所有的空白字符要么是垂直的,要么是水平的,没有重叠,但它们不是合适的子集,因为\h也匹配U+00A0 no - break SPACE, \v也匹配U+0085 NEXT LINE,它们都不被\s匹配
M / /g只需在/ /中留出空间,就可以了。或者使用\S -它将替换所有特殊字符,如制表符、换行符、空格等等。