这是最简单的解释。这是我正在使用的:
re.split('\W', 'foo/bar spam\neggs')
>>> ['foo', 'bar', 'spam', 'eggs']
这是我想要的:
someMethod('\W', 'foo/bar spam\neggs')
>>> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']
原因是我想把一个字符串分割成令牌,操作它,然后再把它组合在一起。
这是最简单的解释。这是我正在使用的:
re.split('\W', 'foo/bar spam\neggs')
>>> ['foo', 'bar', 'spam', 'eggs']
这是我想要的:
someMethod('\W', 'foo/bar spam\neggs')
>>> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']
原因是我想把一个字符串分割成令牌,操作它,然后再把它组合在一起。
当前回答
如果你只有一个分隔符,你可以使用列表推导式:
text = 'foo,bar,baz,qux'
sep = ','
附加/将分隔符:
result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']
result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']
分隔符作为它自己的元素:
result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1] # to get rid of trailing
其他回答
如果你想拆分字符串,同时通过regex保留分隔符,而不捕获组:
def finditer_with_separators(regex, s):
matches = []
prev_end = 0
for match in regex.finditer(s):
match_start = match.start()
if (prev_end != 0 or match_start > 0) and match_start != prev_end:
matches.append(s[prev_end:match.start()])
matches.append(match.group())
prev_end = match.end()
if prev_end < len(s):
matches.append(s[prev_end:])
return matches
regex = re.compile(r"[\(\)]")
matches = finditer_with_separators(regex, s)
如果假设regex被封装到捕获组中:
def split_with_separators(regex, s):
matches = list(filter(None, regex.split(s)))
return matches
regex = re.compile(r"([\(\)])")
matches = split_with_separators(regex, s)
这两种方法也将删除空组,在大多数情况下是无用和恼人的。
另一个例子,在非字母数字上进行分割,并保留分隔符
import re
a = "foo,bar@candy*ice%cream"
re.split('([^a-zA-Z0-9])',a)
输出:
['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']
解释
re.split('([^a-zA-Z0-9])',a)
() <- keep the separators
[] <- match everything in between
^a-zA-Z0-9 <-except alphabets, upper/lower and numbers.
re.split的文档中提到:
根据出现的模式拆分字符串。如果捕获 括号是在模式中使用的,然后是文本中的所有组 模式也作为结果列表的一部分返回。
所以你只需要用一个捕获组来包装分隔符:
>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']
如果你只有一个分隔符,你可以使用列表推导式:
text = 'foo,bar,baz,qux'
sep = ','
附加/将分隔符:
result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']
result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']
分隔符作为它自己的元素:
result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1] # to get rid of trailing
你也可以用字符串数组而不是正则表达式分割字符串,就像这样:
def tokenizeString(aString, separators):
#separators is an array of strings that are being used to split the string.
#sort separators in order of descending length
separators.sort(key=len)
listToReturn = []
i = 0
while i < len(aString):
theSeparator = ""
for current in separators:
if current == aString[i:i+len(current)]:
theSeparator = current
if theSeparator != "":
listToReturn += [theSeparator]
i = i + len(theSeparator)
else:
if listToReturn == []:
listToReturn = [""]
if(listToReturn[-1] in separators):
listToReturn += [""]
listToReturn[-1] += aString[i]
i += 1
return listToReturn
print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))