似乎应该有一种比以下更简单的方法:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
有?
似乎应该有一种比以下更简单的方法:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
有?
当前回答
使用Python从文本文件中删除停止词
print('====THIS IS HOW TO REMOVE STOP WORS====')
with open('one.txt','r')as myFile:
str1=myFile.read()
stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"
myList=[]
myList.extend(str1.split(" "))
for i in myList:
if i not in stop_words:
print ("____________")
print(i,end='\n')
其他回答
从效率的角度来看,你不会击败
s.translate(None, string.punctuation)
对于更高版本的Python,请使用以下代码:
s.translate(str.maketrans('', '', string.punctuation))
它使用查找表在C语言中执行原始字符串操作——除了编写自己的C代码之外,没有什么能比这更好的了。
如果速度不令人担忧,另一个选择是:
exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)
这比用每个字符替换s.replace更快,但不会像正则表达式或字符串转换等非纯python方法那样执行得好,正如您从下面的计时中看到的那样。对于这种类型的问题,在尽可能低的水平上解决是有回报的。
计时代码:
import re, string, timeit
s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))
def test_set(s):
return ''.join(ch for ch in s if ch not in exclude)
def test_re(s): # From Vinko's solution, with fix.
return regex.sub('', s)
def test_trans(s):
return s.translate(table, string.punctuation)
def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s
print "sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)
结果如下:
sets : 19.8566138744
regex : 6.86155414581
translate : 2.12455511093
replace : 28.4436721802
这里有一个使用RegEx的简单方法
import re
punct = re.compile(r'(\w+)')
sentence = 'This ! is : a # sample $ sentence.' # Text with punctuation
tokenized = [m.group() for m in punct.finditer(sentence)]
sentence = ' '.join(tokenized)
print(sentence)
'This is a sample sentence'
这里有一个没有正则表达式的解决方案。
import string
input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()
Output>> where and or then
用空格替换标点用单个空格替换单词之间的多个空格删除尾随空格(如果有)条带()
在不太严格的情况下,单行线可能会有所帮助:
''.join([c for c in s if c.isalnum() or c.isspace()])
这个问题没有太多细节,所以我采取的方法是用最简单的解释来解决这个问题:去掉标点符号。
请注意,所提供的解决方案不考虑收缩词(例如,you are)或连字符词(例如肛门保留)。。。这是关于它们是否应该被视为标点符号的争论。。。也不能解释非英语字符集或类似的东西。。。因为问题中没有提到这些细节。有人认为空格是标点符号,这在技术上是正确的。。。但对我来说,这在当前问题的背景下毫无意义。
# using lambda
''.join(filter(lambda c: c not in string.punctuation, s))
# using list comprehension
''.join('' if c in string.punctuation else c for c in s)