正则表达式中的单词边界是什么?

我试图使用正则表达式来匹配空格分隔的数字。我找不到\b(“单词边界”)的精确定义。我假设-12将是一个“整数词”(与\b\-?\d+\b匹配)，但这似乎不起作用。如果能知道方法，我将不胜感激。

[我在Java 1.6中使用Java正则表达式]

例子:

Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());

String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());

pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());

这将返回:

true
false
true

当前回答

当我在文本中搜索像. net、c++、c#和C这样的单词时，我遇到了一个更糟糕的问题。你可能会认为计算机程序员应该更了解如何为一种难以编写正则表达式的语言命名。

无论如何，这是我发现的(主要是从http://www.regular-expressions.info，这是一个很棒的网站总结出来的):在大多数正则表达式中，由速记字符类\w匹配的字符是被单词边界视为单词字符的字符。Java是个例外。Java支持Unicode \b，但不支持\w。(我敢肯定在当时有一个很好的理由)。

The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.

这就是为什么基于java的正则表达式搜索c++， c#或。net(即使你记得转义句号和加号)会被\b搞砸。

注意:我不知道该如何处理文本中的错误，比如有人在句末的句号后不加空格。我允许这样做，但我不确定这样做一定是正确的。

不管怎样，在Java中，如果要搜索那些名字奇怪的语言的文本，就需要将\b替换为空格和标点符号的前后。例如:

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

然后在你的测试或main函数中:

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

附注:感谢http://regexpal.com/，没有他，正则表达式的世界将会非常悲惨!

2013-12-16 16:54:42

其他回答

我认为它是最后一个匹配或字符串的开始或结束的边界(即字符跟随)。

2009-08-24 20:55:23

我想解释一下艾伦·摩尔的答案

字边界是一个位置，它前面有一个字字符而后面没有一个字字符，或者后面有一个字字符而前面没有一个字字符。

假设我有一个字符串“This is a cat, and she's awesome”，我想替换所有出现的字母“a”，只要这个字母('a')存在于“一个单词的边界”，

换句话说，“cat”里面的字母a不应该被替换。

所以我将执行regex(在Python中)为

re.sub(r"\ba"，"e"， myString.strip()) //用e替换a

因此,

输入;输出

这是一只猫，她很棒

这是猫的结尾，她很恶心

2019-02-11 11:39:19

这就是为什么基于java的正则表达式搜索c++， c#或。net(即使你记得转义句号和加号)会被\b搞砸。

注意:我不知道该如何处理文本中的错误，比如有人在句末的句号后不加空格。我允许这样做，但我不确定这样做一定是正确的。

不管怎样，在Java中，如果要搜索那些名字奇怪的语言的文本，就需要将\b替换为空格和标点符号的前后。例如:

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

然后在你的测试或main函数中:

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

附注:感谢http://regexpal.com/，没有他，正则表达式的世界将会非常悲惨!

2013-12-16 16:54:42

我相信你的问题是由于一个事实-不是一个字字符。因此，单词boundary将在-之后匹配，因此不会捕获它。单词边界匹配字符串中第一个单词字符之前和最后一个单词字符之后，以及在它之前是单词字符或非单词字符，在它之后是相反的任何位置。还要注意，单词边界是零宽度匹配。

一个可能的选择是

(?:(?:^|\s)-?)\d+\b

这将匹配以空格字符和可选破折号开始，并以单词边界结束的任何数字。它还将匹配从字符串开头开始的数字。

2009-08-24 20:59:46

单词边界可以出现在以下三个位置之一:

如果第一个字符是单词字符，则在字符串的第一个字符之前。如果最后一个字符是单词字符，则在字符串的最后一个字符之后。在字符串中的两个字符之间，其中一个是单词字符，另一个不是单词字符。

单词字符是字母-数字;负号不是。摘自正则表达式教程。

2009-08-24 21:05:57

正则表达式中的单词边界是什么?

推荐文章

最新文章

标签