我想我想做的是一项相当常见的任务,但我在网上找不到任何参考资料。我有带标点符号的文本,我想要一个单词列表。

"Hey, you - what are you doing here!?"

应该是

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

但Python的str.split()只对一个参数有效,所以在用空格拆分后,所有单词都带有标点符号。有什么想法吗?


当前回答

这是我的看法。。。。

def split_string(source,splitlist):
    splits = frozenset(splitlist)
    l = []
    s1 = ""
    for c in source:
        if c in splits:
            if s1:
                l.append(s1)
                s1 = ""
        else:
            print s1
            s1 = s1 + c
    if s1:
        l.append(s1)
    return l

>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']

其他回答

创建一个函数,将两个字符串(要拆分的源字符串和分隔符的拆分列表字符串)作为输入,并输出拆分单词列表:

def split_string(source, splitlist):
    output = []  # output list of cleaned words
    atsplit = True
    for char in source:
        if char in splitlist:
            atsplit = True
        else:
            if atsplit:
                output.append(char)  # append new word after split
                atsplit = False
            else: 
                output[-1] = output[-1] + char  # continue copying characters until next split
    return output

实现这一点的另一种方法是使用自然语言工具包(nltk)。

import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens

这张照片显示:[“嘿”、“你”、“什么”、“是”、“您”、“正在做”、“在这里”]

这种方法的最大缺点是需要安装nltk包。

好处是,一旦获得令牌,就可以使用nltk包的其余部分做很多有趣的事情。

正则表达式对正的情况:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

另一种方式,不使用正则表达式

import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()
def get_words(s):
    l = []
    w = ''
    for c in s.lower():
        if c in '-!?,. ':
            if w != '': 
                l.append(w)
            w = ''
        else:
            w = w + c
    if w != '': 
        l.append(w)
    return l

用法如下:

>>> s = "Hey, you - what are you doing here!?"
>>> print get_words(s)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']