我试图使用正则表达式来匹配空格分隔的数字。 我找不到\b(“单词边界”)的精确定义。 我假设-12将是一个“整数词”(与\b\-?\d+\b匹配),但这似乎不起作用。如果能知道方法,我将不胜感激。

[我在Java 1.6中使用Java正则表达式]

例子:

Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());

String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());

pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());

这将返回:

true
false
true

当前回答

我在这里讨论了b风格正则表达式的边界。

简而言之,它们是有条件的。它们的行为取决于它们旁边的东西。

# same as using a \b before:
(?(?=\w) (?<!\w)  | (?<!\W) )

# same as using a \b after:
(?(?<=\w) (?!\w)  | (?!\W)  )

有时候这并不是你想要的。详见我的另一个答案。

其他回答

查看关于边界条件的文档:

http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html

看看这个例子:

public static void main(final String[] args)
    {
        String x = "I found the value -12 in my string.";
        System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
    }

当你把它打印出来时,注意输出是这样的:

[我发现值-,在我的字符串。]

这意味着“-”字符不会因为位于单词的边界而被拾取,因为它不被认为是单词字符。看来@brianary比我先下手为强,所以他得到了支持。

我想解释一下艾伦·摩尔的答案

字边界是一个位置,它前面有一个字字符而后面没有一个字字符,或者后面有一个字字符而前面没有一个字字符。

假设我有一个字符串“This is a cat, and she's awesome”,我想替换所有出现的字母“a”,只要这个字母('a')存在于“一个单词的边界”,

换句话说,“cat”里面的字母a不应该被替换。

所以我将执行regex(在Python中)为

re.sub(r"\ba","e", myString.strip()) //用e替换a

因此,

输入;输出

这是一只猫,她很棒

这是猫的结尾,她很恶心

单词边界可以出现在以下三个位置之一:

如果第一个字符是单词字符,则在字符串的第一个字符之前。 如果最后一个字符是单词字符,则在字符串的最后一个字符之后。 在字符串中的两个字符之间,其中一个是单词字符,另一个不是单词字符。

单词字符是字母-数字;负号不是。 摘自正则表达式教程。

在大多数正则表达式方言中,单词边界是在\w和\w(非单词字符)之间的位置,或者在字符串的开头或结尾(分别)以单词字符([0-9A-Za-z_])开始或结束的位置。

因此,在字符串“-12”中,它将匹配在1之前或2之后。破折号不是文字字符。

当我在文本中搜索像. net、c++、c#和C这样的单词时,我遇到了一个更糟糕的问题。你可能会认为计算机程序员应该更了解如何为一种难以编写正则表达式的语言命名。

无论如何,这是我发现的(主要是从http://www.regular-expressions.info,这是一个很棒的网站总结出来的):在大多数正则表达式中,由速记字符类\w匹配的字符是被单词边界视为单词字符的字符。Java是个例外。Java支持Unicode \b,但不支持\w。(我敢肯定在当时有一个很好的理由)。

The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.

这就是为什么基于java的正则表达式搜索c++, c#或。net(即使你记得转义句号和加号)会被\b搞砸。

注意:我不知道该如何处理文本中的错误,比如有人在句末的句号后不加空格。我允许这样做,但我不确定这样做一定是正确的。

不管怎样,在Java中,如果要搜索那些名字奇怪的语言的文本,就需要将\b替换为空格和标点符号的前后。例如:

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

然后在你的测试或main函数中:

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

附注:感谢http://regexpal.com/,没有他,正则表达式的世界将会非常悲惨!