似乎应该有一种比以下更简单的方法:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
有?
似乎应该有一种比以下更简单的方法:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
有?
当前回答
我还没有看到这个答案。只需使用正则表达式;它删除了除单词字符(\w)和数字字符(\d)之外的所有字符,后跟一个空白字符(\s):
import re
s = "string. With. Punctuation?" # Sample string
out = re.sub(ur'[^\w\d\s]+', '', s)
其他回答
这可能不是最好的解决方案,但我就是这样做的。
import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])
这里有一个没有正则表达式的解决方案。
import string
input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()
Output>> where and or then
用空格替换标点用单个空格替换单词之间的多个空格删除尾随空格(如果有)条带()
作为更新,我重写了Python 3中的@Brian示例,并对其进行了更改,以将正则表达式编译步骤移到函数内部。我在这里的想法是对使功能工作所需的每一步进行计时。也许您使用的是分布式计算,无法在工作人员之间共享regex对象,需要在每个工作人员处执行re.compile步骤。此外,我还很好奇地对Python 3的maketrans的两种不同实现进行计时
table = str.maketrans({key: None for key in string.punctuation})
vs
table = str.maketrans('', '', string.punctuation)
另外,我添加了另一种使用集合的方法,在这里我利用交集函数来减少迭代次数。
这是完整的代码:
import re, string, timeit
s = "string. With. Punctuation"
def test_set(s):
exclude = set(string.punctuation)
return ''.join(ch for ch in s if ch not in exclude)
def test_set2(s):
_punctuation = set(string.punctuation)
for punct in set(s).intersection(_punctuation):
s = s.replace(punct, ' ')
return ' '.join(s.split())
def test_re(s): # From Vinko's solution, with fix.
regex = re.compile('[%s]' % re.escape(string.punctuation))
return regex.sub('', s)
def test_trans(s):
table = str.maketrans({key: None for key in string.punctuation})
return s.translate(table)
def test_trans2(s):
table = str.maketrans('', '', string.punctuation)
return(s.translate(table))
def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s
print("sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2 :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))
这是我的结果:
sets : 3.1830138750374317
sets2 : 2.189873124472797
regex : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace : 4.579746678471565
考虑unicode。代码已在python3中检查。
from unicodedata import category
text = 'hi, how are you?'
text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))
使用Python从文本文件中删除停止词
print('====THIS IS HOW TO REMOVE STOP WORS====')
with open('one.txt','r')as myFile:
str1=myFile.read()
stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"
myList=[]
myList.extend(str1.split(" "))
for i in myList:
if i not in stop_words:
print ("____________")
print(i,end='\n')