这是最简单的解释。这是我正在使用的:
re.split('\W', 'foo/bar spam\neggs')
>>> ['foo', 'bar', 'spam', 'eggs']
这是我想要的:
someMethod('\W', 'foo/bar spam\neggs')
>>> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']
原因是我想把一个字符串分割成令牌,操作它,然后再把它组合在一起。
这是最简单的解释。这是我正在使用的:
re.split('\W', 'foo/bar spam\neggs')
>>> ['foo', 'bar', 'spam', 'eggs']
这是我想要的:
someMethod('\W', 'foo/bar spam\neggs')
>>> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']
原因是我想把一个字符串分割成令牌,操作它,然后再把它组合在一起。
当前回答
如果你在换行上分割,使用splitlines(True)。
>>> 'line 1\nline 2\nline without newline'.splitlines(True)
['line 1\n', 'line 2\n', 'line without newline']
(不是一个通用的解决方案,但在这里添加这个,以防有人来到这里没有意识到这个方法的存在。)
其他回答
我在尝试分割文件路径时遇到了类似的问题,并努力找到一个简单的答案。 这对我来说很有效,并且不需要将分隔符替换回分割文本:
My_path = 'folder1/folder2/folder3/file1'
进口再保险
re.findall ('[^/]+/|[^/]+', my_path)
返回:
['folder1/', 'folder2/', 'folder3/', 'file1']
使用re.split,并且你的正则表达式来自变量,并且你有多个分隔符,你可以像下面这样使用:
# BashSpecialParamList is the special param in bash,
# such as your separator is the bash special param
BashSpecialParamList = ["$*", "$@", "$#", "$?", "$-", "$$", "$!", "$0"]
# aStr is the the string to be splited
aStr = "$a Klkjfd$0 $? $#%$*Sdfdf"
reStr = "|".join([re.escape(sepStr) for sepStr in BashSpecialParamList])
re.split(f'({reStr})', aStr)
# Then You can get the result:
# ['$a Klkjfd', '$0', ' ', '$?', ' ', '$#', '%', '$*', 'Sdfdf']
参考:GNU Bash特殊参数
如果你想拆分字符串,同时通过regex保留分隔符,而不捕获组:
def finditer_with_separators(regex, s):
matches = []
prev_end = 0
for match in regex.finditer(s):
match_start = match.start()
if (prev_end != 0 or match_start > 0) and match_start != prev_end:
matches.append(s[prev_end:match.start()])
matches.append(match.group())
prev_end = match.end()
if prev_end < len(s):
matches.append(s[prev_end:])
return matches
regex = re.compile(r"[\(\)]")
matches = finditer_with_separators(regex, s)
如果假设regex被封装到捕获组中:
def split_with_separators(regex, s):
matches = list(filter(None, regex.split(s)))
return matches
regex = re.compile(r"([\(\)])")
matches = split_with_separators(regex, s)
这两种方法也将删除空组,在大多数情况下是无用和恼人的。
之前发布的一些答案,会重复分隔符,或者有一些我在自己的情况下遇到的其他错误。你可以使用这个函数:
def split_and_keep_delimiter(input, delimiter):
result = list()
idx = 0
while delimiter in input:
idx = input.index(delimiter);
result.append(input[0:idx+len(delimiter)])
input = input[idx+len(delimiter):]
result.append(input)
return result
你也可以用字符串数组而不是正则表达式分割字符串,就像这样:
def tokenizeString(aString, separators):
#separators is an array of strings that are being used to split the string.
#sort separators in order of descending length
separators.sort(key=len)
listToReturn = []
i = 0
while i < len(aString):
theSeparator = ""
for current in separators:
if current == aString[i:i+len(current)]:
theSeparator = current
if theSeparator != "":
listToReturn += [theSeparator]
i = i + len(theSeparator)
else:
if listToReturn == []:
listToReturn = [""]
if(listToReturn[-1] in separators):
listToReturn += [""]
listToReturn[-1] += aString[i]
i += 1
return listToReturn
print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))