假设这个字符串:
The fox jumped over the log.
变成:
The fox jumped over the log.
在不分割和进入列表的情况下,最简单的实现方法(1-2行)是什么?
假设这个字符串:
The fox jumped over the log.
变成:
The fox jumped over the log.
在不分割和进入列表的情况下,最简单的实现方法(1-2行)是什么?
当前回答
这个正是你想要的
old_string = 'The fox jumped over the log '
new_string = " ".join(old_string.split())
print(new_string)
将会导致
The fox jumped over the log.
其他回答
类似于前面的解决方案,但更具体:用一个空格替换两个或多个空格:
>>> import re
>>> s = "The fox jumped over the log."
>>> re.sub('\s{2,}', ' ', s)
'The fox jumped over the log.'
一个简单的灵魂
>>> import re
>>> s="The fox jumped over the log."
>>> print re.sub('\s+',' ', s)
The fox jumped over the log.
一行代码,删除句子之前、之后和内部所有多余的空格:
sentence = " The fox jumped over the log. "
sentence = ' '.join(filter(None,sentence.split(' ')))
解释:
将整个字符串拆分为一个列表。 从列表中过滤空元素。 用一个空格重新连接剩下的元素*
*其余的元素应该是单词或带有标点符号的单词等。我没有对此进行广泛测试,但这应该是一个很好的起点。祝你一切顺利!
使用带有“\s”的正则表达式并执行简单的string.split()也将删除其他空白-如换行符、回车符、制表符。除非需要使用多个空格,否则我将给出这些示例。
我使用了11段,1000个单词,6665字节的Lorem Ipsum来进行真实的时间测试,并在整个过程中使用了随机长度的额外空格:
original_string = ''.join(word + (' ' * random.randint(1, 10)) for word in lorem_ipsum.split(' '))
一行程序实际上是删除任何前导/尾随空格,并保留一个前导/尾随空格(但只有ONE;-)。
# setup = '''
import re
def while_replace(string):
while ' ' in string:
string = string.replace(' ', ' ')
return string
def re_replace(string):
return re.sub(r' {2,}' , ' ', string)
def proper_join(string):
split_string = string.split(' ')
# To account for leading/trailing spaces that would simply be removed
beg = ' ' if not split_string[ 0] else ''
end = ' ' if not split_string[-1] else ''
# versus simply ' '.join(item for item in string.split(' ') if item)
return beg + ' '.join(item for item in split_string if item) + end
original_string = """Lorem ipsum ... no, really, it kept going... malesuada enim feugiat. Integer imperdiet erat."""
assert while_replace(original_string) == re_replace(original_string) == proper_join(original_string)
#'''
# while_replace_test
new_string = original_string[:]
new_string = while_replace(new_string)
assert new_string != original_string
# re_replace_test
new_string = original_string[:]
new_string = re_replace(new_string)
assert new_string != original_string
# proper_join_test
new_string = original_string[:]
new_string = proper_join(new_string)
assert new_string != original_string
NOTE: The "while version" made a copy of the original_string, as I believe once modified on the first run, successive runs would be faster (if only by a bit). As this adds time, I added this string copy to the other two so that the times showed the difference only in the logic. Keep in mind that the main stmt on timeit instances will only be executed once; the original way I did this, the while loop worked on the same label, original_string, thus the second run, there would be nothing to do. The way it's set up now, calling a function, using two different labels, that isn't a problem. I've added assert statements to all the workers to verify we change something every iteration (for those who may be dubious). E.g., change to this and it breaks:
# while_replace_test
new_string = original_string[:]
new_string = while_replace(new_string)
assert new_string != original_string # will break the 2nd iteration
while ' ' in original_string:
original_string = original_string.replace(' ', ' ')
Tests run on a laptop with an i5 processor running Windows 7 (64-bit).
timeit.Timer(stmt = test, setup = setup).repeat(7, 1000)
test_string = 'The fox jumped over\n\t the log.' # trivial
Python 2.7.3, 32-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001066 | 0.001260 | 0.001128 | 0.001092
re_replace_test | 0.003074 | 0.003941 | 0.003357 | 0.003349
proper_join_test | 0.002783 | 0.004829 | 0.003554 | 0.003035
Python 2.7.3, 64-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001025 | 0.001079 | 0.001052 | 0.001051
re_replace_test | 0.003213 | 0.004512 | 0.003656 | 0.003504
proper_join_test | 0.002760 | 0.006361 | 0.004626 | 0.004600
Python 3.2.3, 32-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001350 | 0.002302 | 0.001639 | 0.001357
re_replace_test | 0.006797 | 0.008107 | 0.007319 | 0.007440
proper_join_test | 0.002863 | 0.003356 | 0.003026 | 0.002975
Python 3.3.3, 64-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001444 | 0.001490 | 0.001460 | 0.001459
re_replace_test | 0.011771 | 0.012598 | 0.012082 | 0.011910
proper_join_test | 0.003741 | 0.005933 | 0.004341 | 0.004009
test_string = lorem_ipsum
# Thanks to http://www.lipsum.com/
# "Generated 11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum"
Python 2.7.3, 32-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.342602 | 0.387803 | 0.359319 | 0.356284
re_replace_test | 0.337571 | 0.359821 | 0.348876 | 0.348006
proper_join_test | 0.381654 | 0.395349 | 0.388304 | 0.388193
Python 2.7.3, 64-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.227471 | 0.268340 | 0.240884 | 0.236776
re_replace_test | 0.301516 | 0.325730 | 0.308626 | 0.307852
proper_join_test | 0.358766 | 0.383736 | 0.370958 | 0.371866
Python 3.2.3, 32-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.438480 | 0.463380 | 0.447953 | 0.446646
re_replace_test | 0.463729 | 0.490947 | 0.472496 | 0.468778
proper_join_test | 0.397022 | 0.427817 | 0.406612 | 0.402053
Python 3.3.3, 64-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.284495 | 0.294025 | 0.288735 | 0.289153
re_replace_test | 0.501351 | 0.525673 | 0.511347 | 0.508467
proper_join_test | 0.422011 | 0.448736 | 0.436196 | 0.440318
对于普通字符串,while循环似乎是最快的,其次是Pythonic字符串分割/连接,regex在后面拉。
对于非平凡的字符串,似乎有更多的考虑。32位2.7 ?它是正则表达式的救星!2.7 64位?while循环是最好的。32位3.2,使用“适当的”连接。64位3.3,执行while循环。一次。
最后,如果/何地/何时需要,就可以提高性能,但最好记住这句咒语:
让它起作用 纠正错误 快一点
IANAL, ymv
另一个选择:
>>> import re
>>> str = 'this is a string with multiple spaces and tabs'
>>> str = re.sub('[ \t]+' , ' ', str)
>>> print str
this is a string with multiple spaces and tabs