我有一个这样的字符串:
this is "a test"
我试图在Python中写一些东西,通过空格分割它,同时忽略引号中的空格。我想要的结果是:
['this', 'is', 'a test']
PS,我知道你会问“如果引号中有引号会发生什么,在我的应用程序中,这永远不会发生。
我有一个这样的字符串:
this is "a test"
我试图在Python中写一些东西,通过空格分割它,同时忽略引号中的空格。我想要的结果是:
['this', 'is', 'a test']
PS,我知道你会问“如果引号中有引号会发生什么,在我的应用程序中,这永远不会发生。
当前回答
作为一个选项,尝试tssplit:
In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']
其他回答
如果你不关心子字符串
>>> 'a short sized string with spaces '.split()
性能:
>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass
或者字符串模块
>>> from string import split as stringsplit;
>>> stringsplit('a short sized string with spaces '*100)
性能:String模块的性能似乎比字符串方法更好
>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass
或者你可以使用RE引擎
>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)
性能
>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass
对于非常长的字符串,您不应该将整个字符串加载到内存中,而是将行分开或使用迭代循环
我看到这里的正则表达式方法看起来很复杂和/或错误。这让我很惊讶,因为正则表达式语法可以很容易地描述“空格或引号包围的东西”,而且大多数正则表达式引擎(包括Python的)都可以在正则表达式上进行拆分。所以如果你要使用正则表达式,为什么不直接说出你的意思呢?:
test = 'this is "a test"' # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]
解释:
[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators
不过,Shlex可能提供更多的特性。
要保留引号,使用这个函数:
def getArgs(s):
args = []
cur = ''
inQuotes = 0
for char in s.strip():
if char == ' ' and not inQuotes:
args.append(cur)
cur = ''
elif char == '"' and not inQuotes:
inQuotes = 1
cur += char
elif char == '"' and inQuotes:
inQuotes = 0
cur += char
else:
cur += char
args.append(cur)
return args
由于这个问题带有正则表达式,我决定尝试正则表达式方法。我首先将引号部分中的所有空格替换为\x00,然后按空格分割,然后将\x00替换回每个部分中的空格。
这两个版本都做同样的事情,但是splitter比splitter2更具可读性。
import re
s = 'this is "a test" some text "another test"'
def splitter(s):
def replacer(m):
return m.group(0).replace(" ", "\x00")
parts = re.sub('".+?"', replacer, s).split()
parts = [p.replace("\x00", " ") for p in parts]
return parts
def splitter2(s):
return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]
print splitter2(s)
要解决某些Python 2版本中的unicode问题,我建议:
from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]