在Python中从字符串中移除所有非数字字符

如何在Python中删除字符串中的所有非数字字符?

当前回答

这里有很多正确答案。有些比其他的快或慢。在Ehsan Akbaritabar和tot的答案中使用的方法，str.isdigit过滤，非常快;正如翻译，从亚历克斯·马特利的回答，一旦设置完成。这是两种最快的方法。但是，如果您只做一次替换，那么转换的设置代价将非常大。

哪种方式是最好的取决于您的用例。单元测试中的一个替换?我将使用isdigit进行筛选。它不需要导入，只使用内置，并且快速简单:

''.join(filter(str.isdigit, string_to_filter))

在pandas或pyspark DataFrame中，有数百万行，如果不使用DataFrame提供的方法(往往依赖于regex)，那么翻译的效率可能是值得的。

如果你想使用翻译方法，我建议在Python 3中做一些更改:

import string

unicode_non_digits = dict.fromkeys(
    [x for x in range(65536) if chr(x) not in string.digits]
)
string_to_filter.translate(unicode_non_digits)

Method	Loops	Repeats	Best of result per loop
`filter using isdigit`	1000	15	0.83 usec
`generator using isdigit`	1000	15	1.6 usec
`using re.sub`	1000	15	1.94 usec
`generator testing membership in digits`	1000	15	1.23 usec
`generator testing membership in digits set`	1000	15	1.19 usec
`use translate`	1000	15	0.797 usec
`use re.compile`	1000	15	1.52 usec
`use translate but make translation table every time`	20	5	1.21e+04 usec

表中的最后一行显示翻译的设置惩罚。每次创建翻译表时，我都使用默认的数字和重复选项，否则会花费太长时间。

从我的计时脚本的原始输出:

/bin/zsh /Users/henry.longmore/Library/Application\ Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:6> which python
/Users/henry.longmore/.pyenv/shims/python
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:7> python --version
Python 3.10.6
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:8> set +x
-----filter using isdigit
1000 loops, best of 15: 0.83 usec per loop
-----generator using isdigit
1000 loops, best of 15: 1.6 usec per loop
-----using re.sub
1000 loops, best of 15: 1.94 usec per loop
-----generator testing membership in digits
1000 loops, best of 15: 1.23 usec per loop
-----generator testing membership in digits set
1000 loops, best of 15: 1.19 usec per loop
-----use translate
1000 loops, best of 15: 0.797 usec per loop
-----use re.compile
1000 loops, best of 15: 1.52 usec per loop
-----use translate but make translation table every time
     using default number and repeat, otherwise this takes too long
20 loops, best of 5: 1.21e+04 usec per loop

我用于计时的脚本:

NUMBER=1000
REPEAT=15
UNIT="usec"
TEST_STRING="abc123def45ghi6789"
set -x
which python
python --version
set +x
echo "-----filter using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(filter(str.isdigit, '${TEST_STRING}'))"
echo "-----generator using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(c for c in '${TEST_STRING}' if c.isdigit())"
echo "-----using re.sub"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re" "re.sub('[^0-9]', '', '${TEST_STRING}')"
echo "-----generator testing membership in digits"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----generator testing membership in digits set"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits; digits = {*digits}" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----use translate"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import string; unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits])" "'${TEST_STRING}'.translate(unicode_non_digits)"
echo "-----use re.compile"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re; digit_filter = re.compile('[^0-9]')" "digit_filter.sub('', '${TEST_STRING}')"
echo "-----use translate but make translation table every time"
echo "     using default number and repeat, otherwise this takes too long"
python -m timeit --unit=$UNIT --setup="import string" "unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits]); '${TEST_STRING}'.translate(unicode_non_digits)"

2022-12-01 22:54:01

其他回答

有很多正确答案，但如果你想直接用浮点数，而不使用regex:

x= '$123.45M'

float(''.join(c for c in x if (c.isdigit() or c =='.'))

123.45

您可以根据需要将这个点改为逗号。

如果您知道您的数字是整数，则更改此值

x='$1123'    
int(''.join(c for c in x if c.isdigit())

1123

2020-02-12 23:46:22

>>> import re
>>> re.sub("[^0-9]", "", "sdkjh987978asd098as0980a98sd")
'987978098098098'

2009-08-08 17:25:21

哪种方式是最好的取决于您的用例。单元测试中的一个替换?我将使用isdigit进行筛选。它不需要导入，只使用内置，并且快速简单:

''.join(filter(str.isdigit, string_to_filter))

在pandas或pyspark DataFrame中，有数百万行，如果不使用DataFrame提供的方法(往往依赖于regex)，那么翻译的效率可能是值得的。

如果你想使用翻译方法，我建议在Python 3中做一些更改:

import string

unicode_non_digits = dict.fromkeys(
    [x for x in range(65536) if chr(x) not in string.digits]
)
string_to_filter.translate(unicode_non_digits)

Method	Loops	Repeats	Best of result per loop
`filter using isdigit`	1000	15	0.83 usec
`generator using isdigit`	1000	15	1.6 usec
`using re.sub`	1000	15	1.94 usec
`generator testing membership in digits`	1000	15	1.23 usec
`generator testing membership in digits set`	1000	15	1.19 usec
`use translate`	1000	15	0.797 usec
`use re.compile`	1000	15	1.52 usec
`use translate but make translation table every time`	20	5	1.21e+04 usec

表中的最后一行显示翻译的设置惩罚。每次创建翻译表时，我都使用默认的数字和重复选项，否则会花费太长时间。

从我的计时脚本的原始输出:

/bin/zsh /Users/henry.longmore/Library/Application\ Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:6> which python
/Users/henry.longmore/.pyenv/shims/python
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:7> python --version
Python 3.10.6
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:8> set +x
-----filter using isdigit
1000 loops, best of 15: 0.83 usec per loop
-----generator using isdigit
1000 loops, best of 15: 1.6 usec per loop
-----using re.sub
1000 loops, best of 15: 1.94 usec per loop
-----generator testing membership in digits
1000 loops, best of 15: 1.23 usec per loop
-----generator testing membership in digits set
1000 loops, best of 15: 1.19 usec per loop
-----use translate
1000 loops, best of 15: 0.797 usec per loop
-----use re.compile
1000 loops, best of 15: 1.52 usec per loop
-----use translate but make translation table every time
     using default number and repeat, otherwise this takes too long
20 loops, best of 5: 1.21e+04 usec per loop

我用于计时的脚本:

NUMBER=1000
REPEAT=15
UNIT="usec"
TEST_STRING="abc123def45ghi6789"
set -x
which python
python --version
set +x
echo "-----filter using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(filter(str.isdigit, '${TEST_STRING}'))"
echo "-----generator using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(c for c in '${TEST_STRING}' if c.isdigit())"
echo "-----using re.sub"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re" "re.sub('[^0-9]', '', '${TEST_STRING}')"
echo "-----generator testing membership in digits"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----generator testing membership in digits set"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits; digits = {*digits}" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----use translate"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import string; unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits])" "'${TEST_STRING}'.translate(unicode_non_digits)"
echo "-----use re.compile"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re; digit_filter = re.compile('[^0-9]')" "digit_filter.sub('', '${TEST_STRING}')"
echo "-----use translate but make translation table every time"
echo "     using default number and repeat, otherwise this takes too long"
python -m timeit --unit=$UNIT --setup="import string" "unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits]); '${TEST_STRING}'.translate(unicode_non_digits)"

2022-12-01 22:54:01

为了在这个组合中添加另一个选项，string模块中有几个有用的常量。虽然在其他情况下更有用，但在这里可以使用它们。

>>> from string import digits
>>> ''.join(c for c in "abc123def456" if c in digits)
'123456'

模块中有几个常量，包括:

ascii_letters （abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ）十六进制数字（0123456789abcdefABCDEF）

如果你大量使用这些常量，将它们转换成一个冻结集是值得的。这样可以进行O(1)次查找，而不是O(n)次查找，其中n是原始字符串常量的长度。

>>> digits = frozenset(digits)
>>> ''.join(c for c in "abc123def456" if c in digits)
'123456'

2012-09-07 10:37:03

不确定这是否是最有效的方法，但是:

>>> ''.join(c for c in "abc123def456" if c.isdigit())
'123456'

”。连接部分是指将所有产生的字符组合在一起，中间没有任何字符。然后它的其余部分是一个生成器表达式，其中(正如您可能猜到的那样)我们只取字符串中与条件isdigit匹配的部分。

2009-08-08 17:16:55

在Python中从字符串中移除所有非数字字符

推荐文章

最新文章

标签