我有一个多行字符串,由一组不同的分隔符分隔:
(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)
我可以使用string将这个字符串分割成各个部分。分裂,但似乎我无法获得与分隔符正则表达式匹配的实际字符串。
换句话说,这就是我得到的结果:
Text1
Text2
Text3
Text4
这就是我想要的
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
JDK中是否有任何方法可以使用分隔符正则表达式分割字符串,但同时保留分隔符?
下面是一个基于上面一些代码的groovy版本,以防有用。不管怎样,它很短。有条件地包括头部和尾部(如果它们不是空的)。最后一部分是演示/测试用例。
List splitWithTokens(str, pat) {
def tokens=[]
def lastMatch=0
def m = str=~pat
while (m.find()) {
if (m.start() > 0) tokens << str[lastMatch..<m.start()]
tokens << m.group()
lastMatch=m.end()
}
if (lastMatch < str.length()) tokens << str[lastMatch..<str.length()]
tokens
}
[['<html><head><title>this is the title</title></head>',/<[^>]+>/],
['before<html><head><title>this is the title</title></head>after',/<[^>]+>/]
].each {
println splitWithTokens(*it)
}
我也会发布我的工作版本(第一个是真的类似Markus)。
public static String[] splitIncludeDelimeter(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
int now, old = 0;
while(matcher.find()){
now = matcher.end();
list.add(text.substring(old, now));
old = now;
}
if(list.size() == 0)
return new String[]{text};
//adding rest of a text as last element
String finalElement = text.substring(old);
list.add(finalElement);
return list.toArray(new String[list.size()]);
}
这是第二个解,比第一个快50%
public static String[] splitIncludeDelimeter2(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
StringBuffer stringBuffer = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(stringBuffer, matcher.group());
list.add(stringBuffer.toString());
stringBuffer.setLength(0); //clear buffer
}
matcher.appendTail(stringBuffer); ///dodajemy reszte ciagu
list.add(stringBuffer.toString());
return list.toArray(new String[list.size()]);
}
另一个使用正则表达式的候选解决方案。保留令牌顺序,正确匹配一行中相同类型的多个令牌。缺点是正则表达式有点讨厌。
package javaapplication2;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaApplication2 {
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
String num = "58.5+variable-+98*78/96+a/78.7-3443*12-3";
// Terrifying regex:
// (a)|(b)|(c) match a or b or c
// where
// (a) is one or more digits optionally followed by a decimal point
// followed by one or more digits: (\d+(\.\d+)?)
// (b) is one of the set + * / - occurring once: ([+*/-])
// (c) is a sequence of one or more lowercase latin letter: ([a-z]+)
Pattern tokenPattern = Pattern.compile("(\\d+(\\.\\d+)?)|([+*/-])|([a-z]+)");
Matcher tokenMatcher = tokenPattern.matcher(num);
List<String> tokens = new ArrayList<>();
while (!tokenMatcher.hitEnd()) {
if (tokenMatcher.find()) {
tokens.add(tokenMatcher.group());
} else {
// report error
break;
}
}
System.out.println(tokens);
}
}
样例输出:
[58.5, +, variable, -, +, 98, *, 78, /, 96, +, a, /, 78.7, -, 3443, *, 12, -, 3]