似乎应该有一种比以下更简单的方法:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
有?
似乎应该有一种比以下更简单的方法:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
有?
当前回答
这里有一个没有正则表达式的解决方案。
import string
input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()
Output>> where and or then
用空格替换标点用单个空格替换单词之间的多个空格删除尾随空格(如果有)条带()
其他回答
作为更新,我重写了Python 3中的@Brian示例,并对其进行了更改,以将正则表达式编译步骤移到函数内部。我在这里的想法是对使功能工作所需的每一步进行计时。也许您使用的是分布式计算,无法在工作人员之间共享regex对象,需要在每个工作人员处执行re.compile步骤。此外,我还很好奇地对Python 3的maketrans的两种不同实现进行计时
table = str.maketrans({key: None for key in string.punctuation})
vs
table = str.maketrans('', '', string.punctuation)
另外,我添加了另一种使用集合的方法,在这里我利用交集函数来减少迭代次数。
这是完整的代码:
import re, string, timeit
s = "string. With. Punctuation"
def test_set(s):
exclude = set(string.punctuation)
return ''.join(ch for ch in s if ch not in exclude)
def test_set2(s):
_punctuation = set(string.punctuation)
for punct in set(s).intersection(_punctuation):
s = s.replace(punct, ' ')
return ' '.join(s.split())
def test_re(s): # From Vinko's solution, with fix.
regex = re.compile('[%s]' % re.escape(string.punctuation))
return regex.sub('', s)
def test_trans(s):
table = str.maketrans({key: None for key in string.punctuation})
return s.translate(table)
def test_trans2(s):
table = str.maketrans('', '', string.punctuation)
return(s.translate(table))
def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s
print("sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2 :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))
这是我的结果:
sets : 3.1830138750374317
sets2 : 2.189873124472797
regex : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace : 4.579746678471565
>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)
['string', 'With', 'Punctuation']
在处理Unicode字符串时,我建议使用PyPi正则表达式模块,因为它同时支持Unicode属性类(如\p{X}/\p{X})和POSIX字符类(如[:name:])。
只需在终端中键入pipinstallregex(或pip3installregex)并按回车键即可安装软件包。
如果您需要删除任何类型的标点符号(即除字母、数字和空格之外的任何其他符号),您可以使用
regex.sub(r'[\p{P}\p{S}]', '', text) # to remove one by one
regex.sub(r'[\p{P}\p{S}]+', '', text) # to remove all consecutive punctuation/symbols with one go
regex.sub(r'[[:punct:]]+', '', text) # Same with a POSIX character class
在线观看Python演示:
import regex
text = 'भारत India <><>^$.,,! 002'
new_text = regex.sub(r'[\p{P}\p{S}\s]+', ' ', text).lower().strip()
# OR
# new_text = regex.sub(r'[[:punct:]\s]+', ' ', text).lower().strip()
print(new_text)
# => भारत india 002
在这里,我向字符类添加了空白模式
试试那个:)
regex.sub(r'\p{P}','', s)
这是我写的一个函数。它不是很有效,但很简单,您可以添加或删除任何您想要的标点符号:
def stripPunc(wordList):
"""Strips punctuation from list of words"""
puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
for punc in puncList:
for word in wordList:
wordList=[word.replace(punc,'') for word in wordList]
return wordList