使用多个单词边界分隔符将字符串拆分为单词

我想我想做的是一项相当常见的任务，但我在网上找不到任何参考资料。我有带标点符号的文本，我想要一个单词列表。

"Hey, you - what are you doing here!?"

应该是

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

但Python的str.split（）只对一个参数有效，所以在用空格拆分后，所有单词都带有标点符号。有什么想法吗？

当前回答

我喜欢pprzemek的解决方案，因为它不假设分隔符是单个字符，也不试图利用正则表达式（如果分隔符的数量太长，这将不会很好地工作）。

为清晰起见，以下是上述解决方案的更可读版本：

def split_string_on_multiple_separators(input_string, separators):
    buffer = [input_string]
    for sep in separators:
        strings = buffer
        buffer = []  # reset the buffer
        for s in strings:
            buffer = buffer + s.split(sep)

    return buffer

2019-05-23 17:03:55

其他回答

遇到与@ooboo相同的问题并找到此主题@ghostdog74启发了我，也许有人觉得我的解决方案很有用

str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()

在空格处输入内容，如果不想在空格处拆分，请使用相同的字符进行拆分。

2011-03-15 10:12:20

首先，我不认为您的意图是在拆分函数中实际使用标点符号作为分隔符。您的描述表明您只是想从生成的字符串中删除标点符号。

我经常遇到这种情况，我通常的解决方案不需要re。

单行lambda函数，带列表理解：

（需要导入字符串）：

split_without_punc = lambda text : [word.strip(string.punctuation) for word in 
    text.split() if word.strip(string.punctuation) != '']

# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

功能（传统）

作为传统函数，这仍然只有两行具有列表理解（除了导入字符串）：

def split_without_punctuation2(text):

    # Split by whitespace
    words = text.split()

    # Strip punctuation from each word
    return [word.strip(ignore) for word in words if word.strip(ignore) != '']

split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

它也会自然地保留缩略词和连字符。您可以始终使用text.replace（“-”，“”）在拆分前将连字符转换为空格。

不带Lambda或列表理解的通用函数

对于更一般的解决方案（可以指定要删除的字符），并且不需要列表理解，您可以得到：

def split_without(text: str, ignore: str) -> list:

    # Split by whitespace
    split_string = text.split()

    # Strip any characters in the ignore string, and ignore empty strings
    words = []
    for word in split_string:
        word = word.strip(ignore)
        if word != '':
            words.append(word)

    return words

# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

当然，您也可以将lambda函数推广到任何指定的字符串。

2014-11-04 19:17:37

实现这一点的另一种方法是使用自然语言工具包（nltk）。

import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens

这张照片显示：[“嘿”、“你”、“什么”、“是”、“您”、“正在做”、“在这里”]

这种方法的最大缺点是需要安装nltk包。

好处是，一旦获得令牌，就可以使用nltk包的其余部分做很多有趣的事情。

2009-06-29 18:51:37

这是我与多个决策者的分歧：

def msplit( str, delims ):
  w = ''
  for z in str:
    if z not in delims:
        w += z
    else:
        if len(w) > 0 :
            yield w
        w = ''
  if len(w) > 0 :
    yield w

2011-08-06 11:38:15

这么多的答案，但我找不到任何能有效解决问题标题所要求的问题的解决方案（而是在多个可能的分隔符上拆分，许多答案在任何非单词上拆分，这是不同的）。因此，这是标题中问题的答案，它依赖于Python的标准和高效的重新模块：

>>> import re  # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

哪里：

[…]匹配其中列出的分隔符之一，正则表达式中的\-是为了防止将-作为字符范围指示符（如a-Z）进行特殊解释，+跳过一个或多个分隔符（由于filter（），它可以省略，但这将不必要地在匹配的单字符分隔符之间产生空字符串），以及filter（None，…）删除可能由前导和尾随分隔符创建的空字符串（因为空字符串具有假布尔值）。

正如问题标题中所要求的，这个re.split（）精确地“使用多个分隔符进行拆分”。

此外，该解决方案不受其他一些解决方案中单词中非ASCII字符的问题的影响（参见ghostdog74答案的第一条注释）。

re模块比“手动”执行Python循环和测试更高效（速度和简洁）！

2014-05-18 09:43:54

使用多个单词边界分隔符将字符串拆分为单词

推荐文章

最新文章

标签