从字符串中删除标点符号的最佳方法

似乎应该有一种比以下更简单的方法：

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

有？

当前回答

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)

2017-02-02 21:48:39

其他回答

我喜欢使用这样的函数：

def scrub(abc):
    while abc[-1] is in list(string.punctuation):
        abc=abc[:-1]
    while abc[0] is in list(string.punctuation):
        abc=abc[1:]
    return abc

2013-04-06 17:28:57

我还没有看到这个答案。只需使用正则表达式；它删除了除单词字符（\w）和数字字符（\d）之外的所有字符，后跟一个空白字符（\s）：

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(ur'[^\w\d\s]+', '', s)

2016-06-18 06:38:57

在处理Unicode字符串时，我建议使用PyPi正则表达式模块，因为它同时支持Unicode属性类（如\p｛X｝/\p｛X｝）和POSIX字符类（如[：name：]）。

只需在终端中键入pipinstallregex（或pip3installregex）并按回车键即可安装软件包。

如果您需要删除任何类型的标点符号（即除字母、数字和空格之外的任何其他符号），您可以使用

regex.sub(r'[\p{P}\p{S}]', '', text)  # to remove one by one
regex.sub(r'[\p{P}\p{S}]+', '', text) # to remove all consecutive punctuation/symbols with one go
regex.sub(r'[[:punct:]]+', '', text)  # Same with a POSIX character class

在线观看Python演示：

import regex

text = 'भारत India <><>^$.,,! 002'
new_text = regex.sub(r'[\p{P}\p{S}\s]+', ' ', text).lower().strip()
# OR
# new_text = regex.sub(r'[[:punct:]\s]+', ' ', text).lower().strip()

print(new_text)
# => भारत india 002

在这里，我向字符类添加了空白模式

2021-12-01 14:37:52

考虑unicode。代码已在python3中检查。

from unicodedata import category
text = 'hi, how are you?'
text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))

2020-06-04 05:08:05

不一定更简单，但如果你更熟悉re家族的话，就另辟蹊径。

import re, string
s = "string. With. Punctuation?" # Sample string 
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

2008-11-05 17:39:55

从字符串中删除标点符号的最佳方法

推荐文章

最新文章

标签