我想我想做的是一项相当常见的任务,但我在网上找不到任何参考资料。我有带标点符号的文本,我想要一个单词列表。
"Hey, you - what are you doing here!?"
应该是
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
但Python的str.split()只对一个参数有效,所以在用空格拆分后,所有单词都带有标点符号。有什么想法吗?
我想我想做的是一项相当常见的任务,但我在网上找不到任何参考资料。我有带标点符号的文本,我想要一个单词列表。
"Hey, you - what are you doing here!?"
应该是
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
但Python的str.split()只对一个参数有效,所以在用空格拆分后,所有单词都带有标点符号。有什么想法吗?
当前回答
我认为以下是满足您需求的最佳答案:
\W+可能适合这种情况,但可能不适合其他情况。
filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")
其他回答
遇到与@ooboo相同的问题并找到此主题@ghostdog74启发了我,也许有人觉得我的解决方案很有用
str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()
在空格处输入内容,如果不想在空格处拆分,请使用相同的字符进行拆分。
我必须想出自己的解决方案,因为我迄今为止测试的所有东西都在某一点上失败了。
>>> import re
>>> def split_words(text):
... rgx = re.compile(r"((?:(?<!'|\w)(?:\w-?'?)+(?<!-))|(?:(?<='|\w)(?:\w-?'?)+(?=')))")
... return rgx.findall(text)
至少在下面的例子中,它似乎工作得很好。
>>> split_words("The hill-tops gleam in morning's spring.")
['The', 'hill-tops', 'gleam', 'in', "morning's", 'spring']
>>> split_words("I'd say it's James' 'time'.")
["I'd", 'say', "it's", "James'", 'time']
>>> split_words("tic-tac-toe's tic-tac-toe'll tic-tac'tic-tac we'll--if tic-tac")
["tic-tac-toe's", "tic-tac-toe'll", "tic-tac'tic-tac", "we'll", 'if', 'tic-tac']
>>> split_words("google.com email@google.com split_words")
['google', 'com', 'email', 'google', 'com', 'split_words']
>>> split_words("Kurt Friedrich Gödel (/ˈɡɜːrdəl/;[2] German: [ˈkʊɐ̯t ˈɡøːdl̩] (listen);")
['Kurt', 'Friedrich', 'Gödel', 'ˈɡɜːrdəl', '2', 'German', 'ˈkʊɐ', 't', 'ˈɡøːdl', 'listen']
>>> split_words("April 28, 1906 – January 14, 1978) was an Austro-Hungarian-born Austrian...")
['April', '28', '1906', 'January', '14', '1978', 'was', 'an', 'Austro-Hungarian-born', 'Austrian']
另一种方式,不使用正则表达式
import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()
如果需要可逆操作(保留分隔符),可以使用此函数:
def tokenizeSentence_Reversible(sentence):
setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
listOfTokens = [sentence]
for delimiter in setOfDelimiters:
newListOfTokens = []
for ind, token in enumerate(listOfTokens):
ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
listOfTokens = [item for sublist in ll for item in sublist] # flattens.
listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
newListOfTokens.extend(listOfTokens)
listOfTokens = newListOfTokens
return listOfTokens
这是我的看法。。。。
def split_string(source,splitlist):
splits = frozenset(splitlist)
l = []
s1 = ""
for c in source:
if c in splits:
if s1:
l.append(s1)
s1 = ""
else:
print s1
s1 = s1 + c
if s1:
l.append(s1)
return l
>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']