这是最简单的解释。这是我正在使用的:

re.split('\W', 'foo/bar spam\neggs')
>>> ['foo', 'bar', 'spam', 'eggs']

这是我想要的:

someMethod('\W', 'foo/bar spam\neggs')
>>> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

原因是我想把一个字符串分割成令牌,操作它,然后再把它组合在一起。


当前回答

另一个在Python 3上工作良好的非正则表达式解决方案

# Split strings and keep separator
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', '']

def split_and_keep(s, sep):
   if not s: return [''] # consistent with string.split()

   # Find replacement character that is not used in string
   # i.e. just use the highest available character plus one
   # Note: This fails if ord(max(s)) = 0x10FFFF (ValueError)
   p=chr(ord(max(s))+1) 

   return s.replace(sep, sep+p).split(p)

for s in test_strings:
   print(split_and_keep(s, '<'))


# If the unicode limit is reached it will fail explicitly
unicode_max_char = chr(1114111)
ridiculous_string = '<Hello>'+unicode_max_char+'<World>'
print(split_and_keep(ridiculous_string, '<'))

其他回答

如果你在换行上分割,使用splitlines(True)。

>>> 'line 1\nline 2\nline without newline'.splitlines(True)
['line 1\n', 'line 2\n', 'line without newline']

(不是一个通用的解决方案,但在这里添加这个,以防有人来到这里没有意识到这个方法的存在。)

另一个在Python 3上工作良好的非正则表达式解决方案

# Split strings and keep separator
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', '']

def split_and_keep(s, sep):
   if not s: return [''] # consistent with string.split()

   # Find replacement character that is not used in string
   # i.e. just use the highest available character plus one
   # Note: This fails if ord(max(s)) = 0x10FFFF (ValueError)
   p=chr(ord(max(s))+1) 

   return s.replace(sep, sep+p).split(p)

for s in test_strings:
   print(split_and_keep(s, '<'))


# If the unicode limit is reached it will fail explicitly
unicode_max_char = chr(1114111)
ridiculous_string = '<Hello>'+unicode_max_char+'<World>'
print(split_and_keep(ridiculous_string, '<'))

另一个例子,在非字母数字上进行分割,并保留分隔符

import re
a = "foo,bar@candy*ice%cream"
re.split('([^a-zA-Z0-9])',a)

输出:

['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']

解释

re.split('([^a-zA-Z0-9])',a)

() <- keep the separators
[] <- match everything in between
^a-zA-Z0-9 <-except alphabets, upper/lower and numbers.
>>> line = 'hello_toto_is_there'
>>> sep = '_'
>>> [sep + x[1] if x[0] != 0 else x[1] for x in enumerate(line.split(sep))]
['hello', '_toto', '_is', '_there']

你也可以用字符串数组而不是正则表达式分割字符串,就像这样:

def tokenizeString(aString, separators):
    #separators is an array of strings that are being used to split the string.
    #sort separators in order of descending length
    separators.sort(key=len)
    listToReturn = []
    i = 0
    while i < len(aString):
        theSeparator = ""
        for current in separators:
            if current == aString[i:i+len(current)]:
                theSeparator = current
        if theSeparator != "":
            listToReturn += [theSeparator]
            i = i + len(theSeparator)
        else:
            if listToReturn == []:
                listToReturn = [""]
            if(listToReturn[-1] in separators):
                listToReturn += [""]
            listToReturn[-1] += aString[i]
            i += 1
    return listToReturn
    

print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))