我想我想做的是一项相当常见的任务,但我在网上找不到任何参考资料。我有带标点符号的文本,我想要一个单词列表。

"Hey, you - what are you doing here!?"

应该是

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

但Python的str.split()只对一个参数有效,所以在用空格拆分后,所有单词都带有标点符号。有什么想法吗?


当前回答

我认为以下是满足您需求的最佳答案:

\W+可能适合这种情况,但可能不适合其他情况。

filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")

其他回答

遇到与@ooboo相同的问题并找到此主题@ghostdog74启发了我,也许有人觉得我的解决方案很有用

str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()

在空格处输入内容,如果不想在空格处拆分,请使用相同的字符进行拆分。

我必须想出自己的解决方案,因为我迄今为止测试的所有东西都在某一点上失败了。

>>> import re
>>> def split_words(text):
...     rgx = re.compile(r"((?:(?<!'|\w)(?:\w-?'?)+(?<!-))|(?:(?<='|\w)(?:\w-?'?)+(?=')))")
...     return rgx.findall(text)

至少在下面的例子中,它似乎工作得很好。

>>> split_words("The hill-tops gleam in morning's spring.")
['The', 'hill-tops', 'gleam', 'in', "morning's", 'spring']
>>> split_words("I'd say it's James' 'time'.")
["I'd", 'say', "it's", "James'", 'time']
>>> split_words("tic-tac-toe's tic-tac-toe'll tic-tac'tic-tac we'll--if tic-tac")
["tic-tac-toe's", "tic-tac-toe'll", "tic-tac'tic-tac", "we'll", 'if', 'tic-tac']
>>> split_words("google.com email@google.com split_words")
['google', 'com', 'email', 'google', 'com', 'split_words']
>>> split_words("Kurt Friedrich Gödel (/ˈɡɜːrdəl/;[2] German: [ˈkʊɐ̯t ˈɡøːdl̩] (listen);")
['Kurt', 'Friedrich', 'Gödel', 'ˈɡɜːrdəl', '2', 'German', 'ˈkʊɐ', 't', 'ˈɡøːdl', 'listen']
>>> split_words("April 28, 1906 – January 14, 1978) was an Austro-Hungarian-born Austrian...")
['April', '28', '1906', 'January', '14', '1978', 'was', 'an', 'Austro-Hungarian-born', 'Austrian']

另一种方式,不使用正则表达式

import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()

如果需要可逆操作(保留分隔符),可以使用此函数:

def tokenizeSentence_Reversible(sentence):
    setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
    listOfTokens = [sentence]

    for delimiter in setOfDelimiters:
        newListOfTokens = []
        for ind, token in enumerate(listOfTokens):
            ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
            listOfTokens = [item for sublist in ll for item in sublist] # flattens.
            listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
            newListOfTokens.extend(listOfTokens)

        listOfTokens = newListOfTokens

    return listOfTokens

这是我的看法。。。。

def split_string(source,splitlist):
    splits = frozenset(splitlist)
    l = []
    s1 = ""
    for c in source:
        if c in splits:
            if s1:
                l.append(s1)
                s1 = ""
        else:
            print s1
            s1 = s1 + c
    if s1:
        l.append(s1)
    return l

>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']