在Python中从字符串中移除所有非数字字符

如何在Python中删除字符串中的所有非数字字符?

当前回答

最快的方法，如果你需要执行不止一个或两个这样的删除操作(甚至只是一个，但对一个非常长的字符串!-)，是依赖于字符串的翻译方法，即使它确实需要一些准备:

>>> import string
>>> allchars = ''.join(chr(i) for i in xrange(256))
>>> identity = string.maketrans('', '')
>>> nondigits = allchars.translate(identity, string.digits)
>>> s = 'abc123def456'
>>> s.translate(identity, nondigits)
'123456'

翻译方法是不同的，可能更简单一点，在Unicode字符串上比在字节字符串上更容易使用，顺便说一下:

>>> unondig = dict.fromkeys(xrange(65536))
>>> for x in string.digits: del unondig[ord(x)]
... 
>>> s = u'abc123def456'
>>> s.translate(unondig)
u'123456'

您可能希望使用映射类而不是实际的dict，特别是如果您的Unicode字符串可能包含非常高的ord值的字符(这会使dict过大;-)。例如:

>>> class keeponly(object):
...   def __init__(self, keep): 
...     self.keep = set(ord(c) for c in keep)
...   def __getitem__(self, key):
...     if key in self.keep:
...       return key
...     return None
... 
>>> s.translate(keeponly(string.digits))
u'123456'
>>>

2009-08-08 17:35:59

其他回答

不确定这是否是最有效的方法，但是:

>>> ''.join(c for c in "abc123def456" if c.isdigit())
'123456'

”。连接部分是指将所有产生的字符组合在一起，中间没有任何字符。然后它的其余部分是一个生成器表达式，其中(正如您可能猜到的那样)我们只取字符串中与条件isdigit匹配的部分。

2009-08-08 17:16:55

@Ned Batchelder和@newacct给出了正确答案，但是…

以防万一，如果你的字符串中有逗号(，)decimal(.):

import re
re.sub("[^\d\.]", "", "$1,999,888.77")
'1999888.77'

2018-11-09 15:49:18

这里有很多正确答案。有些比其他的快或慢。在Ehsan Akbaritabar和tot的答案中使用的方法，str.isdigit过滤，非常快;正如翻译，从亚历克斯·马特利的回答，一旦设置完成。这是两种最快的方法。但是，如果您只做一次替换，那么转换的设置代价将非常大。

哪种方式是最好的取决于您的用例。单元测试中的一个替换?我将使用isdigit进行筛选。它不需要导入，只使用内置，并且快速简单:

''.join(filter(str.isdigit, string_to_filter))

在pandas或pyspark DataFrame中，有数百万行，如果不使用DataFrame提供的方法(往往依赖于regex)，那么翻译的效率可能是值得的。

如果你想使用翻译方法，我建议在Python 3中做一些更改:

import string

unicode_non_digits = dict.fromkeys(
    [x for x in range(65536) if chr(x) not in string.digits]
)
string_to_filter.translate(unicode_non_digits)

Method	Loops	Repeats	Best of result per loop
`filter using isdigit`	1000	15	0.83 usec
`generator using isdigit`	1000	15	1.6 usec
`using re.sub`	1000	15	1.94 usec
`generator testing membership in digits`	1000	15	1.23 usec
`generator testing membership in digits set`	1000	15	1.19 usec
`use translate`	1000	15	0.797 usec
`use re.compile`	1000	15	1.52 usec
`use translate but make translation table every time`	20	5	1.21e+04 usec

表中的最后一行显示翻译的设置惩罚。每次创建翻译表时，我都使用默认的数字和重复选项，否则会花费太长时间。

从我的计时脚本的原始输出:

/bin/zsh /Users/henry.longmore/Library/Application\ Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:6> which python
/Users/henry.longmore/.pyenv/shims/python
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:7> python --version
Python 3.10.6
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:8> set +x
-----filter using isdigit
1000 loops, best of 15: 0.83 usec per loop
-----generator using isdigit
1000 loops, best of 15: 1.6 usec per loop
-----using re.sub
1000 loops, best of 15: 1.94 usec per loop
-----generator testing membership in digits
1000 loops, best of 15: 1.23 usec per loop
-----generator testing membership in digits set
1000 loops, best of 15: 1.19 usec per loop
-----use translate
1000 loops, best of 15: 0.797 usec per loop
-----use re.compile
1000 loops, best of 15: 1.52 usec per loop
-----use translate but make translation table every time
     using default number and repeat, otherwise this takes too long
20 loops, best of 5: 1.21e+04 usec per loop

我用于计时的脚本:

NUMBER=1000
REPEAT=15
UNIT="usec"
TEST_STRING="abc123def45ghi6789"
set -x
which python
python --version
set +x
echo "-----filter using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(filter(str.isdigit, '${TEST_STRING}'))"
echo "-----generator using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(c for c in '${TEST_STRING}' if c.isdigit())"
echo "-----using re.sub"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re" "re.sub('[^0-9]', '', '${TEST_STRING}')"
echo "-----generator testing membership in digits"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----generator testing membership in digits set"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits; digits = {*digits}" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----use translate"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import string; unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits])" "'${TEST_STRING}'.translate(unicode_non_digits)"
echo "-----use re.compile"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re; digit_filter = re.compile('[^0-9]')" "digit_filter.sub('', '${TEST_STRING}')"
echo "-----use translate but make translation table every time"
echo "     using default number and repeat, otherwise this takes too long"
python -m timeit --unit=$UNIT --setup="import string" "unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits]); '${TEST_STRING}'.translate(unicode_non_digits)"

2022-12-01 22:54:01

为了在这个组合中添加另一个选项，string模块中有几个有用的常量。虽然在其他情况下更有用，但在这里可以使用它们。

>>> from string import digits
>>> ''.join(c for c in "abc123def456" if c in digits)
'123456'

模块中有几个常量，包括:

ascii_letters （abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ）十六进制数字（0123456789abcdefABCDEF）

如果你大量使用这些常量，将它们转换成一个冻结集是值得的。这样可以进行O(1)次查找，而不是O(n)次查找，其中n是原始字符串常量的长度。

>>> digits = frozenset(digits)
>>> ''.join(c for c in "abc123def456" if c in digits)
'123456'

2012-09-07 10:37:03

>>> import re
>>> re.sub("[^0-9]", "", "sdkjh987978asd098as0980a98sd")
'987978098098098'

2009-08-08 17:25:21

在Python中从字符串中移除所有非数字字符

推荐文章

最新文章

标签