在Python中，用空格分隔字符串——保留带引号的子字符串

我有一个这样的字符串:

this is "a test"

我试图在Python中写一些东西，通过空格分割它，同时忽略引号中的空格。我想要的结果是:

['this', 'is', 'a test']

PS，我知道你会问“如果引号中有引号会发生什么，在我的应用程序中，这永远不会发生。

当前回答

试试这个:

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

一些测试字符串:

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]

2008-09-17 04:31:38

其他回答

你需要从内置的shlex模块中分离。

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

这应该是你想要的效果。

如果你想保留引号，那么你可以传递posix=False kwarg。

>>> shlex.split('this is "a test"', posix=False)
['this', 'is', '"a test"']

2008-09-17 04:27:32

看一下shlex模块，特别是shlex.split。

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

2008-09-17 04:27:59

如果你不关心子字符串

>>> 'a short sized string with spaces '.split()

性能:

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

或者字符串模块

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

性能:String模块的性能似乎比字符串方法更好

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

或者你可以使用RE引擎

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

性能

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

对于非常长的字符串，您不应该将整个字符串加载到内存中，而是将行分开或使用迭代循环

2008-09-17 05:46:57

不同答案的速度测试:

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

2018-04-12 08:28:50

我看到这里的正则表达式方法看起来很复杂和/或错误。这让我很惊讶，因为正则表达式语法可以很容易地描述“空格或引号包围的东西”，而且大多数正则表达式引擎(包括Python的)都可以在正则表达式上进行拆分。所以如果你要使用正则表达式，为什么不直接说出你的意思呢?：

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

解释:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

不过，Shlex可能提供更多的特性。

2009-02-07 23:17:26

在Python中，用空格分隔字符串——保留带引号的子字符串

推荐文章

最新文章

标签