使用多个单词边界分隔符将字符串拆分为单词

我想我想做的是一项相当常见的任务，但我在网上找不到任何参考资料。我有带标点符号的文本，我想要一个单词列表。

"Hey, you - what are you doing here!?"

应该是

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

但Python的str.split（）只对一个参数有效，所以在用空格拆分后，所有单词都带有标点符号。有什么想法吗？

当前回答

如果需要可逆操作（保留分隔符），可以使用此函数：

def tokenizeSentence_Reversible(sentence):
    setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
    listOfTokens = [sentence]

    for delimiter in setOfDelimiters:
        newListOfTokens = []
        for ind, token in enumerate(listOfTokens):
            ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
            listOfTokens = [item for sublist in ll for item in sublist] # flattens.
            listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
            newListOfTokens.extend(listOfTokens)

        listOfTokens = newListOfTokens

    return listOfTokens

2018-01-22 08:25:18

其他回答

我正在重新熟悉Python，需要同样的东西。findall解决方案可能更好，但我想到了这个：

tokens = [x.strip() for x in data.split(',')]

2012-04-20 16:53:46

首先，在循环中执行任何RegEx操作之前，请始终使用re.compile（），因为它的工作速度比正常操作快。

因此，对于您的问题，首先编译模式，然后对其执行操作。

import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)

2015-06-02 07:06:45

这么多的答案，但我找不到任何能有效解决问题标题所要求的问题的解决方案（而是在多个可能的分隔符上拆分，许多答案在任何非单词上拆分，这是不同的）。因此，这是标题中问题的答案，它依赖于Python的标准和高效的重新模块：

>>> import re  # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

哪里：

[…]匹配其中列出的分隔符之一，正则表达式中的\-是为了防止将-作为字符范围指示符（如a-Z）进行特殊解释，+跳过一个或多个分隔符（由于filter（），它可以省略，但这将不必要地在匹配的单字符分隔符之间产生空字符串），以及filter（None，…）删除可能由前导和尾随分隔符创建的空字符串（因为空字符串具有假布尔值）。

正如问题标题中所要求的，这个re.split（）精确地“使用多个分隔符进行拆分”。

此外，该解决方案不受其他一些解决方案中单词中非ASCII字符的问题的影响（参见ghostdog74答案的第一条注释）。

re模块比“手动”执行Python循环和测试更高效（速度和简洁）！

2014-05-18 09:43:54

首先，我不认为您的意图是在拆分函数中实际使用标点符号作为分隔符。您的描述表明您只是想从生成的字符串中删除标点符号。

我经常遇到这种情况，我通常的解决方案不需要re。

单行lambda函数，带列表理解：

（需要导入字符串）：

split_without_punc = lambda text : [word.strip(string.punctuation) for word in 
    text.split() if word.strip(string.punctuation) != '']

# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

功能（传统）

作为传统函数，这仍然只有两行具有列表理解（除了导入字符串）：

def split_without_punctuation2(text):

    # Split by whitespace
    words = text.split()

    # Strip punctuation from each word
    return [word.strip(ignore) for word in words if word.strip(ignore) != '']

split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

它也会自然地保留缩略词和连字符。您可以始终使用text.replace（“-”，“”）在拆分前将连字符转换为空格。

不带Lambda或列表理解的通用函数

对于更一般的解决方案（可以指定要删除的字符），并且不需要列表理解，您可以得到：

def split_without(text: str, ignore: str) -> list:

    # Split by whitespace
    split_string = text.split()

    # Strip any characters in the ignore string, and ignore empty strings
    words = []
    for word in split_string:
        word = word.strip(ignore)
        if word != '':
            words.append(word)

    return words

# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

当然，您也可以将lambda函数推广到任何指定的字符串。

2014-11-04 19:17:37

另一种不使用正则表达式的快速方法是首先替换字符，如下所示：

>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']

2011-08-27 16:10:52

使用多个单词边界分隔符将字符串拆分为单词

推荐文章

最新文章

标签