在Python中，用空格分隔字符串——保留带引号的子字符串

我有一个这样的字符串:

this is "a test"

我试图在Python中写一些东西，通过空格分割它，同时忽略引号中的空格。我想要的结果是:

['this', 'is', 'a test']

PS，我知道你会问“如果引号中有引号会发生什么，在我的应用程序中，这永远不会发生。

当前回答

看一下shlex模块，特别是shlex.split。

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

2008-09-17 04:27:59

其他回答

要保留引号，使用这个函数:

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

2017-03-26 23:08:09

由于这个问题带有正则表达式，我决定尝试正则表达式方法。我首先将引号部分中的所有空格替换为\x00，然后按空格分割，然后将\x00替换回每个部分中的空格。

这两个版本都做同样的事情，但是splitter比splitter2更具可读性。

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

2008-09-17 06:08:38

不同答案的速度测试:

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

2018-04-12 08:28:50

作为一个选项，尝试tssplit:

In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']

2020-03-30 11:49:43

我看到这里的正则表达式方法看起来很复杂和/或错误。这让我很惊讶，因为正则表达式语法可以很容易地描述“空格或引号包围的东西”，而且大多数正则表达式引擎(包括Python的)都可以在正则表达式上进行拆分。所以如果你要使用正则表达式，为什么不直接说出你的意思呢?：

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

解释:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

不过，Shlex可能提供更多的特性。

2009-02-07 23:17:26

在Python中，用空格分隔字符串——保留带引号的子字符串

推荐文章

最新文章

标签