似乎应该有一种比以下更简单的方法:

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

有?


当前回答

试试那个:)

regex.sub(r'\p{P}','', s)

其他回答

这个问题没有太多细节,所以我采取的方法是用最简单的解释来解决这个问题:去掉标点符号。

请注意,所提供的解决方案不考虑收缩词(例如,you are)或连字符词(例如肛门保留)。。。这是关于它们是否应该被视为标点符号的争论。。。也不能解释非英语字符集或类似的东西。。。因为问题中没有提到这些细节。有人认为空格是标点符号,这在技术上是正确的。。。但对我来说,这在当前问题的背景下毫无意义。

# using lambda
''.join(filter(lambda c: c not in string.punctuation, s))

# using list comprehension
''.join('' if c in string.punctuation else c for c in s)

不一定更简单,但如果你更熟悉re家族的话,就另辟蹊径。

import re, string
s = "string. With. Punctuation?" # Sample string 
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

显然,我无法对所选答案进行编辑,所以这里有一个适用于Python3的更新。在进行非平凡转换时,转换方法仍然是最有效的选择。

上面的@Brian为最初的繁重工作做出了贡献。感谢@ddejohn对原始测试的改进建议。

#!/usr/bin/env python3

"""Determination of most efficient way to remove punctuation in Python 3.

Results in Python 3.8.10 on my system using the default arguments:

set       : 51.897
regex     : 17.901
translate :  2.059
replace   : 13.209
"""

import argparse
import re
import string
import timeit

parser = argparse.ArgumentParser()
parser.add_argument("--filename", "-f", default=argparse.__file__)
parser.add_argument("--iterations", "-i", type=int, default=10000)
opts = parser.parse_args()
with open(opts.filename) as fp:
    s = fp.read()
exclude = set(string.punctuation)
table = str.maketrans("", "", string.punctuation)
regex = re.compile(f"[{re.escape(string.punctuation)}]")

def test_set(s):
    return "".join(ch for ch in s if ch not in exclude)

def test_regex(s):  # From Vinko's solution, with fix.
    return regex.sub("", s)

def test_translate(s):
    return s.translate(table)

def test_replace(s):  # From S.Lott's solution
    for c in string.punctuation:
        s = s.replace(c, "")
    return s

opts = dict(globals=globals(), number=opts.iterations)
solutions = "set", "regex", "translate", "replace"
for solution in solutions:
    elapsed = timeit.timeit(f"test_{solution}(s)", **opts)
    print(f"{solution:<10}: {elapsed:6.3f}")

为了方便使用,我总结了Python 2和Python 3中从字符串中删除标点符号的注意事项。有关详细说明,请参阅其他答案。


Python 2

import string

s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation)      # Output: string without punctuation

Python 3

import string

s = "string. With. Punctuation?"
table = str.maketrans(dict.fromkeys(string.punctuation))  # OR {key: None for key in string.punctuation}
new_s = s.translate(table)                          # Output: string without punctuation

作为更新,我重写了Python 3中的@Brian示例,并对其进行了更改,以将正则表达式编译步骤移到函数内部。我在这里的想法是对使功能工作所需的每一步进行计时。也许您使用的是分布式计算,无法在工作人员之间共享regex对象,需要在每个工作人员处执行re.compile步骤。此外,我还很好奇地对Python 3的maketrans的两种不同实现进行计时

table = str.maketrans({key: None for key in string.punctuation})

vs

table = str.maketrans('', '', string.punctuation)

另外,我添加了另一种使用集合的方法,在这里我利用交集函数来减少迭代次数。

这是完整的代码:

import re, string, timeit

s = "string. With. Punctuation"


def test_set(s):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in s if ch not in exclude)


def test_set2(s):
    _punctuation = set(string.punctuation)
    for punct in set(s).intersection(_punctuation):
        s = s.replace(punct, ' ')
    return ' '.join(s.split())


def test_re(s):  # From Vinko's solution, with fix.
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    return regex.sub('', s)


def test_trans(s):
    table = str.maketrans({key: None for key in string.punctuation})
    return s.translate(table)


def test_trans2(s):
    table = str.maketrans('', '', string.punctuation)
    return(s.translate(table))


def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s


print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))

这是我的结果:

sets      : 3.1830138750374317
sets2      : 2.189873124472797
regex     : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace   : 4.579746678471565