这是最简单的解释。这是我正在使用的:

re.split('\W', 'foo/bar spam\neggs')
>>> ['foo', 'bar', 'spam', 'eggs']

这是我想要的:

someMethod('\W', 'foo/bar spam\neggs')
>>> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

原因是我想把一个字符串分割成令牌,操作它,然后再把它组合在一起。


当前回答

在下面的代码中,对这个问题有一个简单、高效且经过测试的答案。代码中有解释其中所有内容的注释。

我保证它并不像看起来那么可怕——它实际上只有13行代码!其余的都是注释、文档和断言

def split_including_delimiters(input: str, delimiter: str):
    """
    Splits an input string, while including the delimiters in the output
    
    Unlike str.split, we can use an empty string as a delimiter
    Unlike str.split, the output will not have any extra empty strings
    Conequently, len(''.split(delimiter))== 0 for all delimiters,
       whereas len(input.split(delimiter))>0 for all inputs and delimiters
    
    INPUTS:
        input: Can be any string
        delimiter: Can be any string

    EXAMPLES:
         >>> split_and_keep_delimiter('Hello World  ! ',' ')
        ans = ['Hello ', 'World ', ' ', '! ', ' ']
         >>> split_and_keep_delimiter("Hello**World**!***", "**")
        ans = ['Hello', '**', 'World', '**', '!', '**', '*']
    EXAMPLES:
        assert split_and_keep_delimiter('-xx-xx-','xx') == ['-', 'xx', '-', 'xx', '-'] # length 5
        assert split_and_keep_delimiter('xx-xx-' ,'xx') == ['xx', '-', 'xx', '-']      # length 4
        assert split_and_keep_delimiter('-xx-xx' ,'xx') == ['-', 'xx', '-', 'xx']      # length 4
        assert split_and_keep_delimiter('xx-xx'  ,'xx') == ['xx', '-', 'xx']           # length 3
        assert split_and_keep_delimiter('xxxx'   ,'xx') == ['xx', 'xx']                # length 2
        assert split_and_keep_delimiter('xxx'    ,'xx') == ['xx', 'x']                 # length 2
        assert split_and_keep_delimiter('x'      ,'xx') == ['x']                       # length 1
        assert split_and_keep_delimiter(''       ,'xx') == []                          # length 0
        assert split_and_keep_delimiter('aaa'    ,'xx') == ['aaa']                     # length 1
        assert split_and_keep_delimiter('aa'     ,'xx') == ['aa']                      # length 1
        assert split_and_keep_delimiter('a'      ,'xx') == ['a']                       # length 1
        assert split_and_keep_delimiter(''       ,''  ) == []                          # length 0
        assert split_and_keep_delimiter('a'      ,''  ) == ['a']                       # length 1
        assert split_and_keep_delimiter('aa'     ,''  ) == ['a', '', 'a']              # length 3
        assert split_and_keep_delimiter('aaa'    ,''  ) == ['a', '', 'a', '', 'a']     # length 5
    """

    # Input assertions
    assert isinstance(input,str), "input must be a string"
    assert isinstance(delimiter,str), "delimiter must be a string"

    if delimiter:
        # These tokens do not include the delimiter, but are computed quickly
        tokens = input.split(delimiter)
    else:
        # Edge case: if the delimiter is the empty string, split between the characters
        tokens = list(input)
        
    # The following assertions are always true for any string input and delimiter
    # For speed's sake, we disable this assertion
    # assert delimiter.join(tokens) == input

    output = tokens[:1]

    for token in tokens[1:]:
        output.append(delimiter)
        if token:
            output.append(token)
    
    # Don't let the first element be an empty string
    if output[:1]==['']:
        del output[0]
        
    # The only case where we should have an empty string in the output is if it is our delimiter
    # For speed's sake, we disable this assertion
    # assert delimiter=='' or '' not in output
        
    # The resulting strings should be combinable back into the original string
    # For speed's sake, we disable this assertion
    # assert ''.join(output) == input

    return output

其他回答

如果你只有一个分隔符,你可以使用列表推导式:

text = 'foo,bar,baz,qux'  
sep = ','

附加/将分隔符:

result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']

result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']

分隔符作为它自己的元素:

result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1]   # to get rid of trailing

re.split的文档中提到:

根据出现的模式拆分字符串。如果捕获 括号是在模式中使用的,然后是文本中的所有组 模式也作为结果列表的一部分返回。

所以你只需要用一个捕获组来包装分隔符:

>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

如果你在换行上分割,使用splitlines(True)。

>>> 'line 1\nline 2\nline without newline'.splitlines(True)
['line 1\n', 'line 2\n', 'line without newline']

(不是一个通用的解决方案,但在这里添加这个,以防有人来到这里没有意识到这个方法的存在。)

这里有一个简单的.split解决方案,不需要regex。

这是一个没有删除分隔符的Python split()的答案,所以不完全是最初的帖子所要求的,但另一个问题被关闭为这个问题的副本。

def splitkeep(s, delimiter):
    split = s.split(delimiter)
    return [substr + delimiter for substr in split[:-1]] + [split[-1]]

随机测试:

import random

CHARS = [".", "a", "b", "c"]
assert splitkeep("", "X") == [""]  # 0 length test
for delimiter in ('.', '..'):
    for _ in range(100000):
        length = random.randint(1, 50)
        s = "".join(random.choice(CHARS) for _ in range(length))
        assert "".join(splitkeep(s, delimiter)) == s

你也可以用字符串数组而不是正则表达式分割字符串,就像这样:

def tokenizeString(aString, separators):
    #separators is an array of strings that are being used to split the string.
    #sort separators in order of descending length
    separators.sort(key=len)
    listToReturn = []
    i = 0
    while i < len(aString):
        theSeparator = ""
        for current in separators:
            if current == aString[i:i+len(current)]:
                theSeparator = current
        if theSeparator != "":
            listToReturn += [theSeparator]
            i = i + len(theSeparator)
        else:
            if listToReturn == []:
                listToReturn = [""]
            if(listToReturn[-1] in separators):
                listToReturn += [""]
            listToReturn[-1] += aString[i]
            i += 1
    return listToReturn
    

print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))