我有一个这样的字符串:
this is "a test"
我试图在Python中写一些东西,通过空格分割它,同时忽略引号中的空格。我想要的结果是:
['this', 'is', 'a test']
PS,我知道你会问“如果引号中有引号会发生什么,在我的应用程序中,这永远不会发生。
我有一个这样的字符串:
this is "a test"
我试图在Python中写一些东西,通过空格分割它,同时忽略引号中的空格。我想要的结果是:
['this', 'is', 'a test']
PS,我知道你会问“如果引号中有引号会发生什么,在我的应用程序中,这永远不会发生。
当前回答
由于这个问题带有正则表达式,我决定尝试正则表达式方法。我首先将引号部分中的所有空格替换为\x00,然后按空格分割,然后将\x00替换回每个部分中的空格。
这两个版本都做同样的事情,但是splitter比splitter2更具可读性。
import re
s = 'this is "a test" some text "another test"'
def splitter(s):
def replacer(m):
return m.group(0).replace(" ", "\x00")
parts = re.sub('".+?"', replacer, s).split()
parts = [p.replace("\x00", " ") for p in parts]
return parts
def splitter2(s):
return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]
print splitter2(s)
其他回答
如果你不关心子字符串
>>> 'a short sized string with spaces '.split()
性能:
>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass
或者字符串模块
>>> from string import split as stringsplit;
>>> stringsplit('a short sized string with spaces '*100)
性能:String模块的性能似乎比字符串方法更好
>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass
或者你可以使用RE引擎
>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)
性能
>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass
对于非常长的字符串,您不应该将整个字符串加载到内存中,而是将行分开或使用迭代循环
根据你的用例,你可能还想检查csv模块:
import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
print(row)
输出:
['this', 'is', 'a string']
['and', 'more', 'stuff']
由于这个问题带有正则表达式,我决定尝试正则表达式方法。我首先将引号部分中的所有空格替换为\x00,然后按空格分割,然后将\x00替换回每个部分中的空格。
这两个版本都做同样的事情,但是splitter比splitter2更具可读性。
import re
s = 'this is "a test" some text "another test"'
def splitter(s):
def replacer(m):
return m.group(0).replace(" ", "\x00")
parts = re.sub('".+?"', replacer, s).split()
parts = [p.replace("\x00", " ") for p in parts]
return parts
def splitter2(s):
return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]
print splitter2(s)
由于性能原因,re似乎更快。下面是我的解决方案,使用最小贪婪操作符,保留外部引号:
re.findall("(?:\".*?\"|\S)+", s)
结果:
['this', 'is', '"a test"']
它将像aaa“bla blub”bbb这样的结构放在一起,因为这些标记没有被空格分隔。如果字符串包含转义字符,你可以这样匹配:
>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""
请注意,这也通过模式的\S部分来匹配空字符串“”。
不同答案的速度测试:
import re
import shlex
import csv
line = 'this is "a test"'
%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop
%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop
%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop
%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop