假设这个字符串:

The   fox jumped   over    the log.

变成:

The fox jumped over the log.

在不分割和进入列表的情况下,最简单的实现方法(1-2行)是什么?


当前回答

这个正是你想要的

old_string = 'The   fox jumped   over    the log '
new_string = " ".join(old_string.split())
print(new_string)

将会导致

The fox jumped over the log.

其他回答

类似于前面的解决方案,但更具体:用一个空格替换两个或多个空格:

>>> import re
>>> s = "The   fox jumped   over    the log."
>>> re.sub('\s{2,}', ' ', s)
'The fox jumped over the log.'

一个简单的灵魂

>>> import re
>>> s="The   fox jumped   over    the log."
>>> print re.sub('\s+',' ', s)
The fox jumped over the log.

一行代码,删除句子之前、之后和内部所有多余的空格:

sentence = "  The   fox jumped   over    the log.  "
sentence = ' '.join(filter(None,sentence.split(' ')))

解释:

将整个字符串拆分为一个列表。 从列表中过滤空元素。 用一个空格重新连接剩下的元素*

*其余的元素应该是单词或带有标点符号的单词等。我没有对此进行广泛测试,但这应该是一个很好的起点。祝你一切顺利!

使用带有“\s”的正则表达式并执行简单的string.split()也将删除其他空白-如换行符、回车符、制表符。除非需要使用多个空格,否则我将给出这些示例。

我使用了11段,1000个单词,6665字节的Lorem Ipsum来进行真实的时间测试,并在整个过程中使用了随机长度的额外空格:

original_string = ''.join(word + (' ' * random.randint(1, 10)) for word in lorem_ipsum.split(' '))

一行程序实际上是删除任何前导/尾随空格,并保留一个前导/尾随空格(但只有ONE;-)。

# setup = '''

import re

def while_replace(string):
    while '  ' in string:
        string = string.replace('  ', ' ')

    return string

def re_replace(string):
    return re.sub(r' {2,}' , ' ', string)

def proper_join(string):
    split_string = string.split(' ')

    # To account for leading/trailing spaces that would simply be removed
    beg = ' ' if not split_string[ 0] else ''
    end = ' ' if not split_string[-1] else ''

    # versus simply ' '.join(item for item in string.split(' ') if item)
    return beg + ' '.join(item for item in split_string if item) + end

original_string = """Lorem    ipsum        ... no, really, it kept going...          malesuada enim feugiat.         Integer imperdiet    erat."""

assert while_replace(original_string) == re_replace(original_string) == proper_join(original_string)

#'''

# while_replace_test
new_string = original_string[:]

new_string = while_replace(new_string)

assert new_string != original_string

# re_replace_test
new_string = original_string[:]

new_string = re_replace(new_string)

assert new_string != original_string

# proper_join_test
new_string = original_string[:]

new_string = proper_join(new_string)

assert new_string != original_string

NOTE: The "while version" made a copy of the original_string, as I believe once modified on the first run, successive runs would be faster (if only by a bit). As this adds time, I added this string copy to the other two so that the times showed the difference only in the logic. Keep in mind that the main stmt on timeit instances will only be executed once; the original way I did this, the while loop worked on the same label, original_string, thus the second run, there would be nothing to do. The way it's set up now, calling a function, using two different labels, that isn't a problem. I've added assert statements to all the workers to verify we change something every iteration (for those who may be dubious). E.g., change to this and it breaks:

# while_replace_test
new_string = original_string[:]

new_string = while_replace(new_string)

assert new_string != original_string # will break the 2nd iteration

while '  ' in original_string:
    original_string = original_string.replace('  ', ' ')

Tests run on a laptop with an i5 processor running Windows 7 (64-bit).

timeit.Timer(stmt = test, setup = setup).repeat(7, 1000)

test_string = 'The   fox jumped   over\n\t    the log.' # trivial

Python 2.7.3, 32-bit, Windows
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.001066 |   0.001260 |   0.001128 |   0.001092
     re_replace_test |   0.003074 |   0.003941 |   0.003357 |   0.003349
    proper_join_test |   0.002783 |   0.004829 |   0.003554 |   0.003035

Python 2.7.3, 64-bit, Windows
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.001025 |   0.001079 |   0.001052 |   0.001051
     re_replace_test |   0.003213 |   0.004512 |   0.003656 |   0.003504
    proper_join_test |   0.002760 |   0.006361 |   0.004626 |   0.004600

Python 3.2.3, 32-bit, Windows
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.001350 |   0.002302 |   0.001639 |   0.001357
     re_replace_test |   0.006797 |   0.008107 |   0.007319 |   0.007440
    proper_join_test |   0.002863 |   0.003356 |   0.003026 |   0.002975

Python 3.3.3, 64-bit, Windows
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.001444 |   0.001490 |   0.001460 |   0.001459
     re_replace_test |   0.011771 |   0.012598 |   0.012082 |   0.011910
    proper_join_test |   0.003741 |   0.005933 |   0.004341 |   0.004009

test_string = lorem_ipsum
# Thanks to http://www.lipsum.com/
# "Generated 11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum"

Python 2.7.3, 32-bit
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.342602 |   0.387803 |   0.359319 |   0.356284
     re_replace_test |   0.337571 |   0.359821 |   0.348876 |   0.348006
    proper_join_test |   0.381654 |   0.395349 |   0.388304 |   0.388193    

Python 2.7.3, 64-bit
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.227471 |   0.268340 |   0.240884 |   0.236776
     re_replace_test |   0.301516 |   0.325730 |   0.308626 |   0.307852
    proper_join_test |   0.358766 |   0.383736 |   0.370958 |   0.371866    

Python 3.2.3, 32-bit
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.438480 |   0.463380 |   0.447953 |   0.446646
     re_replace_test |   0.463729 |   0.490947 |   0.472496 |   0.468778
    proper_join_test |   0.397022 |   0.427817 |   0.406612 |   0.402053    

Python 3.3.3, 64-bit
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.284495 |   0.294025 |   0.288735 |   0.289153
     re_replace_test |   0.501351 |   0.525673 |   0.511347 |   0.508467
    proper_join_test |   0.422011 |   0.448736 |   0.436196 |   0.440318

对于普通字符串,while循环似乎是最快的,其次是Pythonic字符串分割/连接,regex在后面拉。

对于非平凡的字符串,似乎有更多的考虑。32位2.7 ?它是正则表达式的救星!2.7 64位?while循环是最好的。32位3.2,使用“适当的”连接。64位3.3,执行while循环。一次。

最后,如果/何地/何时需要,就可以提高性能,但最好记住这句咒语:

让它起作用 纠正错误 快一点

IANAL, ymv

另一个选择:

>>> import re
>>> str = 'this is a            string with    multiple spaces and    tabs'
>>> str = re.sub('[ \t]+' , ' ', str)
>>> print str
this is a string with multiple spaces and tabs