我有时想匹配空白,而不是换行符。

到目前为止,我一直在使用[\t]。有不那么尴尬的方式吗?


当前回答

您要寻找的是POSIX空白字符类。在Perl中,它被引用为:

[[:blank:]]

在Java中(不要忘记启用UNICODE_CHARACTER_CLASS):

\p{Blank}

Compared to the similar \h, POSIX blank is supported by a few more regex engines (reference). A major benefit is that its definition is fixed in Annex C: Compatibility Properties of Unicode Regular Expressions and standard across all regex flavors that support Unicode. (In Perl, for example, \h chooses to additionally include the MONGOLIAN VOWEL SEPARATOR.) However, an argument in favor of \h is that it always detects Unicode characters (even if the engines don't agree on which), while POSIX character classes are often by default ASCII-only (as in Java).

但问题是,即使坚持使用Unicode也不能100%解决问题。考虑以下字符,它们在Unicode中不被视为空格:

U+ 180e蒙古语元音分隔符 U+ 200b零宽空间 U+ 200c零宽度非细木工 U+ 200d零宽细木工 U+2060字木工 U+ feff零宽度不间断空间 摘自https://en.wikipedia.org/wiki/White-space_character

The aforementioned Mongolian vowel separator isn't included for what is probably a good reason. It, along with 200C and 200D, occur within words (AFAIK), and therefore breaks the cardinal rule that all other whitespace obeys: you can tokenize with it. They're more like modifiers. However, ZERO WIDTH SPACE, WORD JOINER, and ZERO WIDTH NON-BREAKING SPACE (if it used as other than a byte-order mark) fit the whitespace rule in my book. Therefore, I include them in my horizontal whitespace character class.

在Java中:

static public final String HORIZONTAL_WHITESPACE = "[\\p{Blank}\\u200B\\u2060\\uFFEF]"

其他回答

下面的正则表达式将匹配空格,但不匹配新行字符。

(?:(?!\n)\s)

DEMO

如果你想添加回车,那么在负前向中添加带|操作符的\r。

(?:(?![\n\r])\s)

DEMO

在非捕获组后添加+以匹配一个或多个空白。

(?:(?![\n\r])\s)+

DEMO

我不知道为什么你们没有提到POSIX字符类[[:blank:]],它匹配任何水平空白(空格和制表符)。这个POSIX字符类可以在BRE(基本正则表达式)、ERE(扩展正则表达式)、PCRE(Perl兼容正则表达式)上工作。

DEMO

您要寻找的是POSIX空白字符类。在Perl中,它被引用为:

[[:blank:]]

在Java中(不要忘记启用UNICODE_CHARACTER_CLASS):

\p{Blank}

Compared to the similar \h, POSIX blank is supported by a few more regex engines (reference). A major benefit is that its definition is fixed in Annex C: Compatibility Properties of Unicode Regular Expressions and standard across all regex flavors that support Unicode. (In Perl, for example, \h chooses to additionally include the MONGOLIAN VOWEL SEPARATOR.) However, an argument in favor of \h is that it always detects Unicode characters (even if the engines don't agree on which), while POSIX character classes are often by default ASCII-only (as in Java).

但问题是,即使坚持使用Unicode也不能100%解决问题。考虑以下字符,它们在Unicode中不被视为空格:

U+ 180e蒙古语元音分隔符 U+ 200b零宽空间 U+ 200c零宽度非细木工 U+ 200d零宽细木工 U+2060字木工 U+ feff零宽度不间断空间 摘自https://en.wikipedia.org/wiki/White-space_character

The aforementioned Mongolian vowel separator isn't included for what is probably a good reason. It, along with 200C and 200D, occur within words (AFAIK), and therefore breaks the cardinal rule that all other whitespace obeys: you can tokenize with it. They're more like modifiers. However, ZERO WIDTH SPACE, WORD JOINER, and ZERO WIDTH NON-BREAKING SPACE (if it used as other than a byte-order mark) fit the whitespace rule in my book. Therefore, I include them in my horizontal whitespace character class.

在Java中:

static public final String HORIZONTAL_WHITESPACE = "[\\p{Blank}\\u200B\\u2060\\uFFEF]"

将下面的正则表达式放在查找部分,并从“搜索模式”中选择正则表达式:

[^\S\r\n]+

Greg的答案也包含了回车:

/[^\S\r\n]/

这个正则表达式比没有\r的/[^\S\n]/更安全。我的理由是Windows使用\r\n作为换行符,而Mac OS 9使用\r。现在你不太可能找到不带\n的\r,但如果你确实找到了,它只能表示换行符。因此,既然\r可以表示换行符,我们也应该排除它。

使用双重否定:

/[^\S\r\n]/

也就是说,非空格符(大写S作为补充)或非回车符或非换行符。根据De Morgan定律分配外部not(即字符类中的补^),这相当于“空格但不包含回车或换行符”。在模式中同时包含\r和\n可以正确地处理所有Unix (LF)、经典Mac OS (CR)和DOS-ish (CR LF)换行约定。

没必要相信我的话:

#! /usr/bin/env perl

use strict;
use warnings;

use 5.005;  # for qr//

my $ws_not_crlf = qr/[^\S\r\n]/;

for (' ', '\f', '\t', '\r', '\n') {
  my $qq = qq["$_"];
  printf "%-4s => %s\n", $qq,
    (eval $qq) =~ $ws_not_crlf ? "match" : "no match";
}

输出:

" "  => match
"\f" => match
"\t" => match
"\r" => no match
"\n" => no match

注意,排除了垂直制表符,但这在v5.18中得到了解决。

在强烈反对之前,Perl文档使用了相同的技术。perlrecharclass的“Whitespace”部分的脚注如下

在Perl v5.18之前,\s不匹配垂直选项卡。[^\S\cK](模糊地)匹配了\S的传统功能。

perlrecharclass的同一部分还提出了其他不会冒犯语言教师反对双重否定的方法。

在区域设置和Unicode规则之外,或者当/a开关生效时,“\s匹配[\t\n\f\r],并且从Perl v5.18开始,垂直选项卡\cK。”丢弃\r和\n,留下/[\t\f\cK]/,用于匹配空白而不是换行符。

如果您的文本是Unicode,则使用类似于下面子代码的代码从前面提到的文档部分中的表构建模式。

sub ws_not_nl {
  local($_) = <<'EOTable';
0x0009        CHARACTER TABULATION   h s
0x000a              LINE FEED (LF)    vs
0x000b             LINE TABULATION    vs  [1]
0x000c              FORM FEED (FF)    vs
0x000d        CARRIAGE RETURN (CR)    vs
0x0020                       SPACE   h s
0x0085             NEXT LINE (NEL)    vs  [2]
0x00a0              NO-BREAK SPACE   h s  [2]
0x1680            OGHAM SPACE MARK   h s
0x2000                     EN QUAD   h s
0x2001                     EM QUAD   h s
0x2002                    EN SPACE   h s
0x2003                    EM SPACE   h s
0x2004          THREE-PER-EM SPACE   h s
0x2005           FOUR-PER-EM SPACE   h s
0x2006            SIX-PER-EM SPACE   h s
0x2007                FIGURE SPACE   h s
0x2008           PUNCTUATION SPACE   h s
0x2009                  THIN SPACE   h s
0x200a                  HAIR SPACE   h s
0x2028              LINE SEPARATOR    vs
0x2029         PARAGRAPH SEPARATOR    vs
0x202f       NARROW NO-BREAK SPACE   h s
0x205f   MEDIUM MATHEMATICAL SPACE   h s
0x3000           IDEOGRAPHIC SPACE   h s
EOTable

  my $class;
  while (/^0x([0-9a-f]{4})\s+([A-Z\s]+)/mg) {
    my($hex,$name) = ($1,$2);
    next if $name =~ /\b(?:CR|NL|NEL|SEPARATOR)\b/;
    $class .= "\\N{U+$hex}";
  }

  qr/[$class]/u;
}

其他应用程序

双重否定技巧在匹配字母字符时也很方便。记住,\w匹配“单词字符”、字母字符、数字和下划线。我们丑陋的美国人有时想把它写成,

if (/[A-Za-z]+/) { ... }

但是双重否定字符类可以尊重区域设置:

if (/[^\W\d_]+/) { ... }

用这种方式表达“一个字字符而不是数字或下划线”有点不透明。POSIX字符类更直接地传达意图

if (/[[:alpha:]]+/) { ... }

或者使用szbalint建议的Unicode属性

if (/\p{Letter}+/) { ... }