似乎应该有一种比以下更简单的方法:

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

有?


当前回答

字符串标点符号仅为ASCII!一种更正确(但也慢得多)的方法是使用unicodedata模块:

# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with -  «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s

您还可以概括和剥离其他类型的字符:

''.join(ch for ch in s if category(ch)[0] not in 'SP')

它还将删除~*+§$等字符,这些字符可能是“标点符号”,也可能不是“标点符号。

其他回答

不一定更简单,但如果你更熟悉re家族的话,就另辟蹊径。

import re, string
s = "string. With. Punctuation?" # Sample string 
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

我通常用这样的词:

>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
...     s= s.replace(c,"")
...
>>> s
'string With Punctuation'

在处理Unicode字符串时,我建议使用PyPi正则表达式模块,因为它同时支持Unicode属性类(如\p{X}/\p{X})和POSIX字符类(如[:name:])。

只需在终端中键入pipinstallregex(或pip3installregex)并按回车键即可安装软件包。

如果您需要删除任何类型的标点符号(即除字母、数字和空格之外的任何其他符号),您可以使用

regex.sub(r'[\p{P}\p{S}]', '', text)  # to remove one by one
regex.sub(r'[\p{P}\p{S}]+', '', text) # to remove all consecutive punctuation/symbols with one go
regex.sub(r'[[:punct:]]+', '', text)  # Same with a POSIX character class

在线观看Python演示:

import regex

text = 'भारत India <><>^$.,,! 002'
new_text = regex.sub(r'[\p{P}\p{S}\s]+', ' ', text).lower().strip()
# OR
# new_text = regex.sub(r'[[:punct:]\s]+', ' ', text).lower().strip()

print(new_text)
# => भारत india 002

在这里,我向字符类添加了空白模式

我喜欢使用这样的函数:

def scrub(abc):
    while abc[-1] is in list(string.punctuation):
        abc=abc[:-1]
    while abc[0] is in list(string.punctuation):
        abc=abc[1:]
    return abc

使用Python从文本文件中删除停止词

print('====THIS IS HOW TO REMOVE STOP WORS====')

with open('one.txt','r')as myFile:

    str1=myFile.read()

    stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"

    myList=[]

    myList.extend(str1.split(" "))

    for i in myList:

        if i not in stop_words:

            print ("____________")

            print(i,end='\n')