似乎应该有一种比以下更简单的方法:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
有?
似乎应该有一种比以下更简单的方法:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
有?
当前回答
这是我写的一个函数。它不是很有效,但很简单,您可以添加或删除任何您想要的标点符号:
def stripPunc(wordList):
"""Strips punctuation from list of words"""
puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
for punc in puncList:
for word in wordList:
wordList=[word.replace(punc,'') for word in wordList]
return wordList
其他回答
这可能不是最好的解决方案,但我就是这样做的。
import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])
这是我写的一个函数。它不是很有效,但很简单,您可以添加或删除任何您想要的标点符号:
def stripPunc(wordList):
"""Strips punctuation from list of words"""
puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
for punc in puncList:
for word in wordList:
wordList=[word.replace(punc,'') for word in wordList]
return wordList
字符串标点符号仅为ASCII!一种更正确(但也慢得多)的方法是使用unicodedata模块:
# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with - «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s
您还可以概括和剥离其他类型的字符:
''.join(ch for ch in s if category(ch)[0] not in 'SP')
它还将删除~*+§$等字符,这些字符可能是“标点符号”,也可能不是“标点符号。
这个问题没有太多细节,所以我采取的方法是用最简单的解释来解决这个问题:去掉标点符号。
请注意,所提供的解决方案不考虑收缩词(例如,you are)或连字符词(例如肛门保留)。。。这是关于它们是否应该被视为标点符号的争论。。。也不能解释非英语字符集或类似的东西。。。因为问题中没有提到这些细节。有人认为空格是标点符号,这在技术上是正确的。。。但对我来说,这在当前问题的背景下毫无意义。
# using lambda
''.join(filter(lambda c: c not in string.punctuation, s))
# using list comprehension
''.join('' if c in string.punctuation else c for c in s)
作为更新,我重写了Python 3中的@Brian示例,并对其进行了更改,以将正则表达式编译步骤移到函数内部。我在这里的想法是对使功能工作所需的每一步进行计时。也许您使用的是分布式计算,无法在工作人员之间共享regex对象,需要在每个工作人员处执行re.compile步骤。此外,我还很好奇地对Python 3的maketrans的两种不同实现进行计时
table = str.maketrans({key: None for key in string.punctuation})
vs
table = str.maketrans('', '', string.punctuation)
另外,我添加了另一种使用集合的方法,在这里我利用交集函数来减少迭代次数。
这是完整的代码:
import re, string, timeit
s = "string. With. Punctuation"
def test_set(s):
exclude = set(string.punctuation)
return ''.join(ch for ch in s if ch not in exclude)
def test_set2(s):
_punctuation = set(string.punctuation)
for punct in set(s).intersection(_punctuation):
s = s.replace(punct, ' ')
return ' '.join(s.split())
def test_re(s): # From Vinko's solution, with fix.
regex = re.compile('[%s]' % re.escape(string.punctuation))
return regex.sub('', s)
def test_trans(s):
table = str.maketrans({key: None for key in string.punctuation})
return s.translate(table)
def test_trans2(s):
table = str.maketrans('', '', string.punctuation)
return(s.translate(table))
def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s
print("sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2 :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))
这是我的结果:
sets : 3.1830138750374317
sets2 : 2.189873124472797
regex : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace : 4.579746678471565