我试图使用正则表达式来匹配空格分隔的数字。 我找不到\b(“单词边界”)的精确定义。 我假设-12将是一个“整数词”(与\b\-?\d+\b匹配),但这似乎不起作用。如果能知道方法,我将不胜感激。

[我在Java 1.6中使用Java正则表达式]

例子:

Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());

String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());

pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());

这将返回:

true
false
true

当前回答

单词边界可以出现在以下三个位置之一:

如果第一个字符是单词字符,则在字符串的第一个字符之前。 如果最后一个字符是单词字符,则在字符串的最后一个字符之后。 在字符串中的两个字符之间,其中一个是单词字符,另一个不是单词字符。

单词字符是字母-数字;负号不是。 摘自正则表达式教程。

其他回答

我认为它是最后一个匹配或字符串的开始或结束的边界(即字符跟随)。

我在这里讨论了b风格正则表达式的边界。

简而言之,它们是有条件的。它们的行为取决于它们旁边的东西。

# same as using a \b before:
(?(?=\w) (?<!\w)  | (?<!\W) )

# same as using a \b after:
(?(?<=\w) (?!\w)  | (?!\W)  )

有时候这并不是你想要的。详见我的另一个答案。

当我在文本中搜索像. net、c++、c#和C这样的单词时,我遇到了一个更糟糕的问题。你可能会认为计算机程序员应该更了解如何为一种难以编写正则表达式的语言命名。

无论如何,这是我发现的(主要是从http://www.regular-expressions.info,这是一个很棒的网站总结出来的):在大多数正则表达式中,由速记字符类\w匹配的字符是被单词边界视为单词字符的字符。Java是个例外。Java支持Unicode \b,但不支持\w。(我敢肯定在当时有一个很好的理由)。

The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.

这就是为什么基于java的正则表达式搜索c++, c#或。net(即使你记得转义句号和加号)会被\b搞砸。

注意:我不知道该如何处理文本中的错误,比如有人在句末的句号后不加空格。我允许这样做,但我不确定这样做一定是正确的。

不管怎样,在Java中,如果要搜索那些名字奇怪的语言的文本,就需要将\b替换为空格和标点符号的前后。例如:

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

然后在你的测试或main函数中:

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

附注:感谢http://regexpal.com/,没有他,正则表达式的世界将会非常悲惨!

在学习正则表达式的过程中,我真的被\b这个元字符困住了。当我反复地问自己“它是什么,它是什么”时,我确实没有理解它的意思。在使用这个网站做了一些尝试之后,我注意到了每个单词开头和单词结尾的粉色垂直破折号。当时我很好地理解了它的意思。现在正好是word(\w)-boundary。

我的观点仅仅是以理解为导向的。其背后的逻辑应该从另一个答案来考察。

当您使用\\b(\\w+)+\\b时,这表示与只包含单词字符的单词完全匹配([a- za - z0 -9])

例如,在regex开头设置\\b将接受-12(带空格),但同样不接受-12(不带空格)

供参考支持我的话:https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html