我有一个多行字符串,由一组不同的分隔符分隔:

(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)

我可以使用string将这个字符串分割成各个部分。分裂,但似乎我无法获得与分隔符正则表达式匹配的实际字符串。

换句话说,这就是我得到的结果:

Text1 Text2 Text3 Text4

这就是我想要的

Text1 DelimiterA Text2 DelimiterC Text3 DelimiterB Text4

JDK中是否有任何方法可以使用分隔符正则表达式分割字符串,但同时保留分隔符?


当前回答

这个问题的一个微妙之处涉及到“前导分隔符”问题:如果要有一个组合的令牌和分隔符数组,则必须知道它是以令牌还是以分隔符开始的。你当然可以假设前导界限应该被丢弃,但这似乎是一个不合理的假设。你可能还想知道你是否有一个拖拽的delim。这将相应地设置两个布尔标志。

用Groovy编写,但Java版本应该相当明显:

            String tokenRegex = /[\p{L}\p{N}]+/ // a String in Groovy, Unicode alphanumeric
            def finder = phraseForTokenising =~ tokenRegex
            // NB in Groovy the variable 'finder' is then of class java.util.regex.Matcher
            def finderIt = finder.iterator() // extra method added to Matcher by Groovy magic
            int start = 0
            boolean leadingDelim, trailingDelim
            def combinedTokensAndDelims = [] // create an array in Groovy

            while( finderIt.hasNext() )
            {
                def token = finderIt.next()
                int finderStart = finder.start()
                String delim = phraseForTokenising[ start  .. finderStart - 1 ]
                // Groovy: above gets slice of String/array
                if( start == 0 ) leadingDelim = finderStart != 0
                if( start > 0 || leadingDelim ) combinedTokensAndDelims << delim
                combinedTokensAndDelims << token // add element to end of array
                start = finder.end()
            }
            // start == 0 indicates no tokens found
            if( start > 0 ) {
                // finish by seeing whether there is a trailing delim
                trailingDelim = start < phraseForTokenising.length()
                if( trailingDelim ) combinedTokensAndDelims << phraseForTokenising[ start .. -1 ]

                println( "leading delim? $leadingDelim, trailing delim? $trailingDelim, combined array:\n $combinedTokensAndDelims" )

            }

其他回答

import java.util.regex.*;
import java.util.LinkedList;

public class Splitter {
    private static final Pattern DEFAULT_PATTERN = Pattern.compile("\\s+");

    private Pattern pattern;
    private boolean keep_delimiters;

    public Splitter(Pattern pattern, boolean keep_delimiters) {
        this.pattern = pattern;
        this.keep_delimiters = keep_delimiters;
    }
    public Splitter(String pattern, boolean keep_delimiters) {
        this(Pattern.compile(pattern==null?"":pattern), keep_delimiters);
    }
    public Splitter(Pattern pattern) { this(pattern, true); }
    public Splitter(String pattern) { this(pattern, true); }
    public Splitter(boolean keep_delimiters) { this(DEFAULT_PATTERN, keep_delimiters); }
    public Splitter() { this(DEFAULT_PATTERN); }

    public String[] split(String text) {
        if (text == null) {
            text = "";
        }

        int last_match = 0;
        LinkedList<String> splitted = new LinkedList<String>();

        Matcher m = this.pattern.matcher(text);

        while (m.find()) {

            splitted.add(text.substring(last_match,m.start()));

            if (this.keep_delimiters) {
                splitted.add(m.group());
            }

            last_match = m.end();
        }

        splitted.add(text.substring(last_match));

        return splitted.toArray(new String[splitted.size()]);
    }

    public static void main(String[] argv) {
        if (argv.length != 2) {
            System.err.println("Syntax: java Splitter <pattern> <text>");
            return;
        }

        Pattern pattern = null;
        try {
            pattern = Pattern.compile(argv[0]);
        }
        catch (PatternSyntaxException e) {
            System.err.println(e);
            return;
        }

        Splitter splitter = new Splitter(pattern);

        String text = argv[1];
        int counter = 1;
        for (String part : splitter.split(text)) {
            System.out.printf("Part %d: \"%s\"\n", counter++, part);
        }
    }
}

/*
    Example:
    > java Splitter "\W+" "Hello World!"
    Part 1: "Hello"
    Part 2: " "
    Part 3: "World"
    Part 4: "!"
    Part 5: ""
*/

我不太喜欢另一种方式,前后都有一个空元素。分隔符通常不在字符串的开头或结尾,因此通常会浪费两个良好的数组插槽。

编辑:固定的限制情况。带有测试用例的注释源代码可以在这里找到:http://snippets.dzone.com/posts/show/6453

调整pattern .split()以将匹配的模式包含到列表中

添加

// add match to the list
        matchList.add(input.subSequence(start, end).toString());

完整的源

public static String[] inclusiveSplit(String input, String re, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<String>();

    Pattern pattern = Pattern.compile(re);
    Matcher m = pattern.matcher(input);

    // Add segments before each match found
    while (m.find()) {
        int end = m.end();
        if (!matchLimited || matchList.size() < limit - 1) {
            int start = m.start();
            String match = input.subSequence(index, start).toString();
            matchList.add(match);
            // add match to the list
            matchList.add(input.subSequence(start, end).toString());
            index = end;
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index, input.length())
                    .toString();
            matchList.add(match);
            index = end;
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] { input.toString() };

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize - 1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

另一个使用正则表达式的候选解决方案。保留令牌顺序,正确匹配一行中相同类型的多个令牌。缺点是正则表达式有点讨厌。

package javaapplication2;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class JavaApplication2 {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        String num = "58.5+variable-+98*78/96+a/78.7-3443*12-3";

        // Terrifying regex:
        //  (a)|(b)|(c) match a or b or c
        // where
        //   (a) is one or more digits optionally followed by a decimal point
        //       followed by one or more digits: (\d+(\.\d+)?)
        //   (b) is one of the set + * / - occurring once: ([+*/-])
        //   (c) is a sequence of one or more lowercase latin letter: ([a-z]+)
        Pattern tokenPattern = Pattern.compile("(\\d+(\\.\\d+)?)|([+*/-])|([a-z]+)");
        Matcher tokenMatcher = tokenPattern.matcher(num);

        List<String> tokens = new ArrayList<>();

        while (!tokenMatcher.hitEnd()) {
            if (tokenMatcher.find()) {
                tokens.add(tokenMatcher.group());
            } else {
                // report error
                break;
            }
        }

        System.out.println(tokens);
    }
}

样例输出:

[58.5, +, variable, -, +, 98, *, 78, /, 96, +, a, /, 78.7, -, 3443, *, 12, -, 3]

我不太懂Java,但如果你找不到一个Split方法,我建议你自己做一个。

string[] mySplit(string s,string delimiter)
{
    string[] result = s.Split(delimiter);
    for(int i=0;i<result.Length-1;i++)
    {
        result[i] += delimiter; //this one would add the delimiter to each items end except the last item, 
                    //you can modify it however you want
    }
}
string[] res = mySplit(myString,myDelimiter);

它不是很优雅,但也可以。

传递第三个参数为“true”。它还将返回分隔符。

StringTokenizer(String str, String delimiters, true);