使用多个单词边界分隔符将字符串拆分为单词

我想我想做的是一项相当常见的任务，但我在网上找不到任何参考资料。我有带标点符号的文本，我想要一个单词列表。

"Hey, you - what are you doing here!?"

应该是

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

但Python的str.split（）只对一个参数有效，所以在用空格拆分后，所有单词都带有标点符号。有什么想法吗？

当前回答

我最近需要这样做，但需要一个与标准库str.split函数有点匹配的函数，当使用0或1参数调用时，该函数的行为与标准库相同。

def split_many(string, *separators):
    if len(separators) == 0:
        return string.split()
    if len(separators) > 1:
        table = {
            ord(separator): ord(separator[0])
            for separator in separators
        }
        string = string.translate(table)
    return string.split(separators[0])

注意：此函数仅在分隔符由单个字符组成时有用（正如我的用例）。

2019-05-17 08:22:13

其他回答

另一种不使用正则表达式的快速方法是首先替换字符，如下所示：

>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']

2011-08-27 16:10:52

这是一个有一些解释的答案。

st = "Hey, you - what are you doing here!?"

# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey  you  what are you doing here  '

# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()

# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'

或者在一行中，我们可以这样做：

(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()

# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

更新的答案

2016-06-04 19:35:58

首先，在循环中执行任何RegEx操作之前，请始终使用re.compile（），因为它的工作速度比正常操作快。

因此，对于您的问题，首先编译模式，然后对其执行操作。

import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)

2015-06-02 07:06:45

这是我与多个决策者的分歧：

def msplit( str, delims ):
  w = ''
  for z in str:
    if z not in delims:
        w += z
    else:
        if len(w) > 0 :
            yield w
        w = ''
  if len(w) > 0 :
    yield w

2011-08-06 11:38:15

我认为以下是满足您需求的最佳答案：

\W+可能适合这种情况，但可能不适合其他情况。

filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")

2012-03-09 08:30:11

使用多个单词边界分隔符将字符串拆分为单词

推荐文章

最新文章

标签