使用多个单词边界分隔符将字符串拆分为单词

我想我想做的是一项相当常见的任务，但我在网上找不到任何参考资料。我有带标点符号的文本，我想要一个单词列表。

"Hey, you - what are you doing here!?"

应该是

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

但Python的str.split（）只对一个参数有效，所以在用空格拆分后，所有单词都带有标点符号。有什么想法吗？

当前回答

使用maketrans和translate，您可以轻松、整洁地完成

import string
specials = ',.!?:;"()<>[]#$=-/'
trans = string.maketrans(specials, ' '*len(specials))
body = body.translate(trans)
words = body.strip().split()

2018-03-03 23:59:23

其他回答

实现这一点的另一种方法是使用自然语言工具包（nltk）。

import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens

这张照片显示：[“嘿”、“你”、“什么”、“是”、“您”、“正在做”、“在这里”]

这种方法的最大缺点是需要安装nltk包。

好处是，一旦获得令牌，就可以使用nltk包的其余部分做很多有趣的事情。

2009-06-29 18:51:37

首先，我不认为您的意图是在拆分函数中实际使用标点符号作为分隔符。您的描述表明您只是想从生成的字符串中删除标点符号。

我经常遇到这种情况，我通常的解决方案不需要re。

单行lambda函数，带列表理解：

（需要导入字符串）：

split_without_punc = lambda text : [word.strip(string.punctuation) for word in 
    text.split() if word.strip(string.punctuation) != '']

# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

功能（传统）

作为传统函数，这仍然只有两行具有列表理解（除了导入字符串）：

def split_without_punctuation2(text):

    # Split by whitespace
    words = text.split()

    # Strip punctuation from each word
    return [word.strip(ignore) for word in words if word.strip(ignore) != '']

split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

它也会自然地保留缩略词和连字符。您可以始终使用text.replace（“-”，“”）在拆分前将连字符转换为空格。

不带Lambda或列表理解的通用函数

对于更一般的解决方案（可以指定要删除的字符），并且不需要列表理解，您可以得到：

def split_without(text: str, ignore: str) -> list:

    # Split by whitespace
    split_string = text.split()

    # Strip any characters in the ignore string, and ignore empty strings
    words = []
    for word in split_string:
        word = word.strip(ignore)
        if word != '':
            words.append(word)

    return words

# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

当然，您也可以将lambda函数推广到任何指定的字符串。

2014-11-04 19:17:37

我喜欢re，但这是我没有它的解决方案：

from itertools import groupby
sep = ' ,-!?'
s = "Hey, you - what are you doing here!?"
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]

sep__contains__是“in”运算符使用的方法。基本上与

lambda ch: ch in sep

但在这里更方便。

groupby获取字符串和函数。它使用该函数将字符串分成多个组：每当函数值发生变化时，就会生成一个新的组。因此，sep__contains__正是我们需要的。

groupby返回一个对序列，其中对[0]是我们函数的结果，对[1]是一个组。使用“if not k”，我们筛选出带有分隔符的组（因为sep.__contains__的结果在分隔符上为True）。好了，就这些了-现在我们有一个组序列，每个组都是一个单词（组实际上是一个可迭代的，所以我们使用join将其转换为字符串）。

这个解决方案非常通用，因为它使用一个函数来分隔字符串（您可以根据需要的任何条件进行拆分）。此外，它不创建中间字符串/列表（您可以删除join，因为每个组都是一个迭代器，所以表达式将变得懒惰）

2013-10-06 17:30:05

这是一个有一些解释的答案。

st = "Hey, you - what are you doing here!?"

# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey  you  what are you doing here  '

# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()

# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'

或者在一行中，我们可以这样做：

(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()

# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

更新的答案

2016-06-04 19:35:58

这是我的看法。。。。

def split_string(source,splitlist):
    splits = frozenset(splitlist)
    l = []
    s1 = ""
    for c in source:
        if c in splits:
            if s1:
                l.append(s1)
                s1 = ""
        else:
            print s1
            s1 = s1 + c
    if s1:
        l.append(s1)
    return l

>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']

2013-04-29 05:32:04

使用多个单词边界分隔符将字符串拆分为单词

推荐文章

最新文章

标签