从字符串中删除标点符号的最佳方法

似乎应该有一种比以下更简单的方法：

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

有？

当前回答

正则表达式很简单，如果你知道的话。

import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

2013-05-28 18:47:47

其他回答

您也可以这样做：

import string
' '.join(word.strip(string.punctuation) for word in 'text'.split())

2021-04-27 11:48:29

试试那个：）

regex.sub(r'\p{P}','', s)

2020-09-02 07:51:45

在处理Unicode字符串时，我建议使用PyPi正则表达式模块，因为它同时支持Unicode属性类（如\p｛X｝/\p｛X｝）和POSIX字符类（如[：name：]）。

只需在终端中键入pipinstallregex（或pip3installregex）并按回车键即可安装软件包。

如果您需要删除任何类型的标点符号（即除字母、数字和空格之外的任何其他符号），您可以使用

regex.sub(r'[\p{P}\p{S}]', '', text)  # to remove one by one
regex.sub(r'[\p{P}\p{S}]+', '', text) # to remove all consecutive punctuation/symbols with one go
regex.sub(r'[[:punct:]]+', '', text)  # Same with a POSIX character class

在线观看Python演示：

import regex

text = 'भारत India <><>^$.,,! 002'
new_text = regex.sub(r'[\p{P}\p{S}\s]+', ' ', text).lower().strip()
# OR
# new_text = regex.sub(r'[[:punct:]\s]+', ' ', text).lower().strip()

print(new_text)
# => भारत india 002

在这里，我向字符类添加了空白模式

2021-12-01 14:37:52

显然，我无法对所选答案进行编辑，所以这里有一个适用于Python3的更新。在进行非平凡转换时，转换方法仍然是最有效的选择。

上面的@Brian为最初的繁重工作做出了贡献。感谢@ddejohn对原始测试的改进建议。

#!/usr/bin/env python3

"""Determination of most efficient way to remove punctuation in Python 3.

Results in Python 3.8.10 on my system using the default arguments:

set       : 51.897
regex     : 17.901
translate :  2.059
replace   : 13.209
"""

import argparse
import re
import string
import timeit

parser = argparse.ArgumentParser()
parser.add_argument("--filename", "-f", default=argparse.__file__)
parser.add_argument("--iterations", "-i", type=int, default=10000)
opts = parser.parse_args()
with open(opts.filename) as fp:
    s = fp.read()
exclude = set(string.punctuation)
table = str.maketrans("", "", string.punctuation)
regex = re.compile(f"[{re.escape(string.punctuation)}]")

def test_set(s):
    return "".join(ch for ch in s if ch not in exclude)

def test_regex(s):  # From Vinko's solution, with fix.
    return regex.sub("", s)

def test_translate(s):
    return s.translate(table)

def test_replace(s):  # From S.Lott's solution
    for c in string.punctuation:
        s = s.replace(c, "")
    return s

opts = dict(globals=globals(), number=opts.iterations)
solutions = "set", "regex", "translate", "replace"
for solution in solutions:
    elapsed = timeit.timeit(f"test_{solution}(s)", **opts)
    print(f"{solution:<10}: {elapsed:6.3f}")

2021-10-05 13:28:02

不一定更简单，但如果你更熟悉re家族的话，就另辟蹊径。

import re, string
s = "string. With. Punctuation?" # Sample string 
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

2008-11-05 17:39:55

从字符串中删除标点符号的最佳方法

推荐文章

最新文章

标签