我试图使用正则表达式来匹配空格分隔的数字。 我找不到\b(“单词边界”)的精确定义。 我假设-12将是一个“整数词”(与\b\-?\d+\b匹配),但这似乎不起作用。如果能知道方法,我将不胜感激。

[我在Java 1.6中使用Java正则表达式]

例子:

Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());

String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());

pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());

这将返回:

true
false
true

当前回答

当我在文本中搜索像. net、c++、c#和C这样的单词时,我遇到了一个更糟糕的问题。你可能会认为计算机程序员应该更了解如何为一种难以编写正则表达式的语言命名。

无论如何,这是我发现的(主要是从http://www.regular-expressions.info,这是一个很棒的网站总结出来的):在大多数正则表达式中,由速记字符类\w匹配的字符是被单词边界视为单词字符的字符。Java是个例外。Java支持Unicode \b,但不支持\w。(我敢肯定在当时有一个很好的理由)。

The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.

这就是为什么基于java的正则表达式搜索c++, c#或。net(即使你记得转义句号和加号)会被\b搞砸。

注意:我不知道该如何处理文本中的错误,比如有人在句末的句号后不加空格。我允许这样做,但我不确定这样做一定是正确的。

不管怎样,在Java中,如果要搜索那些名字奇怪的语言的文本,就需要将\b替换为空格和标点符号的前后。例如:

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

然后在你的测试或main函数中:

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

附注:感谢http://regexpal.com/,没有他,正则表达式的世界将会非常悲惨!

其他回答

当我在文本中搜索像. net、c++、c#和C这样的单词时,我遇到了一个更糟糕的问题。你可能会认为计算机程序员应该更了解如何为一种难以编写正则表达式的语言命名。

无论如何,这是我发现的(主要是从http://www.regular-expressions.info,这是一个很棒的网站总结出来的):在大多数正则表达式中,由速记字符类\w匹配的字符是被单词边界视为单词字符的字符。Java是个例外。Java支持Unicode \b,但不支持\w。(我敢肯定在当时有一个很好的理由)。

The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.

这就是为什么基于java的正则表达式搜索c++, c#或。net(即使你记得转义句号和加号)会被\b搞砸。

注意:我不知道该如何处理文本中的错误,比如有人在句末的句号后不加空格。我允许这样做,但我不确定这样做一定是正确的。

不管怎样,在Java中,如果要搜索那些名字奇怪的语言的文本,就需要将\b替换为空格和标点符号的前后。例如:

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

然后在你的测试或main函数中:

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

附注:感谢http://regexpal.com/,没有他,正则表达式的世界将会非常悲惨!

我想解释一下艾伦·摩尔的答案

字边界是一个位置,它前面有一个字字符而后面没有一个字字符,或者后面有一个字字符而前面没有一个字字符。

假设我有一个字符串“This is a cat, and she's awesome”,我想替换所有出现的字母“a”,只要这个字母('a')存在于“一个单词的边界”,

换句话说,“cat”里面的字母a不应该被替换。

所以我将执行regex(在Python中)为

re.sub(r"\ba","e", myString.strip()) //用e替换a

因此,

输入;输出

这是一只猫,她很棒

这是猫的结尾,她很恶心

在大多数正则表达式方言中,单词边界是在\w和\w(非单词字符)之间的位置,或者在字符串的开头或结尾(分别)以单词字符([0-9A-Za-z_])开始或结束的位置。

因此,在字符串“-12”中,它将匹配在1之前或2之后。破折号不是文字字符。

我认为它是最后一个匹配或字符串的开始或结束的边界(即字符跟随)。

单词边界是一个位置,它的前面有一个单词字符而后面没有一个,或者后面有一个单词字符而前面没有一个。