我有一个字符串列表,我想执行一个自然的字母排序。
例如,下面的列表是自然排序(我想要的):
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
下面是上面列表的“排序”版本(我使用sorted()得到的):
['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
我在寻找一个排序函数它的行为和第一个一样。
我有一个字符串列表,我想执行一个自然的字母排序。
例如,下面的列表是自然排序(我想要的):
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
下面是上面列表的“排序”版本(我使用sorted()得到的):
['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
我在寻找一个排序函数它的行为和第一个一样。
当前回答
这是一个更高级的解决方案,由Claudiu和Mark Byers改进:
它使用casefold()而不是lower()来匹配字符串 您可以传递另一个键lambda来选择一个内部元素(就像您习惯使用普通排序函数一样) 它当然适用于列表。Sort, sorted, max,等等。
def natural_sort(key=None, _nsre=re.compile('([0-9]+)')):
return lambda x: [int(text) if text.isdigit() else text.casefold()
for text in _nsre.split(key(x) if key else x)]
使用示例:
# Original solution
data.sort(key=natural_sort)
# Select an additional key
image_files.sort(key=natural_sort(lambda x: x.original_filename))
其他回答
为了记录,下面是Mark Byers的简单解决方案的另一个变体,类似于Walter Tross建议的解决方案,避免调用isdigit()。这不仅使它更快,而且还避免了可能发生的问题,因为与regex \d+相比,isdigit()将更多的unicode字符视为数字。
import re
from itertools import cycle
_re_digits = re.compile(r"(\d+)")
def natural_comparison_key(key):
return tuple(
int(part) if is_digit else part
for part, is_digit in zip(_re_digits.split(key), cycle((False, True)))
)
试试这个:
import re
def natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
return sorted(l, key=alphanum_key)
输出:
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
代码改编自这里:排序人类:自然排序顺序。
让我就这一需求提出自己的看法:
from typing import Tuple, Union, Optional, Generator
StrOrInt = Union[str, int]
# On Python 3.6, string concatenation is REALLY fast
# Tested myself, and this fella also tested:
# https://blog.ganssle.io/articles/2019/11/string-concat.html
def griter(s: str) -> Generator[StrOrInt, None, None]:
last_was_digit: Optional[bool] = None
cluster: str = ""
for c in s:
if last_was_digit is None:
last_was_digit = c.isdigit()
cluster += c
continue
if c.isdigit() != last_was_digit:
if last_was_digit:
yield int(cluster)
else:
yield cluster
last_was_digit = c.isdigit()
cluster = ""
cluster += c
if last_was_digit:
yield int(cluster)
else:
yield cluster
return
def grouper(s: str) -> Tuple[StrOrInt, ...]:
return tuple(griter(s))
现在如果我们有这样的列表:
filelist = [
'File3', 'File007', 'File3a', 'File10', 'File11', 'File1', 'File4', 'File5',
'File9', 'File8', 'File8b1', 'File8b2', 'File8b11', 'File6'
]
我们可以简单地使用key= kwarg来进行自然排序:
>>> sorted(filelist, key=grouper)
['File1', 'File3', 'File3a', 'File4', 'File5', 'File6', 'File007', 'File8',
'File8b1', 'File8b2', 'File8b11', 'File9', 'File10', 'File11']
当然,这里的缺点是,就像现在一样,该函数将对大写字母在小写字母之前进行排序。
我将把不区分大小写的grouper的实现留给读者:-)
下面是马克·拜尔斯的另一个版本的回答。这个版本演示了如何传入一个属性名,该属性名将用于计算列表中的对象。
def natural_sort(l, attrib):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key.__dict__[attrib])]
return sorted(l, key=alphanum_key)
results = natural_sort(albums, 'albumid')
其中albums是一个Album实例列表,albumid是一个字符串属性,名义上包含数字。
很可能functools.cmp_to_key()与python的sort的底层实现密切相关。此外,cmp参数是遗留的。现代的方法是将输入项转换为支持所需的丰富比较操作的对象。
在CPython 2下。X,即使没有实现各自的富比较操作符,也可以对不同类型的对象排序。在CPython 3下。X,不同类型的对象必须显式地支持比较。参见Python如何比较字符串和int?链接到官方文件。大多数答案都依赖于这种隐含的顺序。切换到Python 3。X将需要一个新的类型来实现和统一数字和字符串之间的比较。
Python 2.7.12 (default, Sep 29 2016, 13:30:34)
>>> (0,"foo") < ("foo",0)
True
Python 3.5.2 (default, Oct 14 2016, 12:54:53)
>>> (0,"foo") < ("foo",0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unorderable types: int() < str()
有三种不同的方法。第一种使用嵌套类来利用Python的Iterable比较算法。第二个函数将这个嵌套展开到单个类中。第三种方法放弃了str的子类化,专注于性能。所有都是有时间的;第二辆快了一倍,第三辆快了近六倍。对str进行子类化并不是必需的,而且首先可能是一个坏主意,但它确实带来了某些便利。
排序字符被复制以强制按大小写排序,并交换大小写以强制小写字母优先排序;这就是“自然排序”的典型定义。我无法决定分组的类型;有些人可能更喜欢以下选项,这也会带来显著的性能优势:
d = lambda s: s.lower()+s.swapcase()
在使用的地方,比较运算符被设置为object的运算符,这样它们就不会被functools. total_orders忽略。
import functools
import itertools
@functools.total_ordering
class NaturalStringA(str):
def __repr__(self):
return "{}({})".format\
( type(self).__name__
, super().__repr__()
)
d = lambda c, s: [ c.NaturalStringPart("".join(v))
for k,v in
itertools.groupby(s, c.isdigit)
]
d = classmethod(d)
@functools.total_ordering
class NaturalStringPart(str):
d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
d = staticmethod(d)
def __lt__(self, other):
if not isinstance(self, type(other)):
return NotImplemented
try:
return int(self) < int(other)
except ValueError:
if self.isdigit():
return True
elif other.isdigit():
return False
else:
return self.d(self) < self.d(other)
def __eq__(self, other):
if not isinstance(self, type(other)):
return NotImplemented
try:
return int(self) == int(other)
except ValueError:
if self.isdigit() or other.isdigit():
return False
else:
return self.d(self) == self.d(other)
__le__ = object.__le__
__ne__ = object.__ne__
__gt__ = object.__gt__
__ge__ = object.__ge__
def __lt__(self, other):
return self.d(self) < self.d(other)
def __eq__(self, other):
return self.d(self) == self.d(other)
__le__ = object.__le__
__ne__ = object.__ne__
__gt__ = object.__gt__
__ge__ = object.__ge__
import functools
import itertools
@functools.total_ordering
class NaturalStringB(str):
def __repr__(self):
return "{}({})".format\
( type(self).__name__
, super().__repr__()
)
d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
d = staticmethod(d)
def __lt__(self, other):
if not isinstance(self, type(other)):
return NotImplemented
groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
zipped = itertools.zip_longest(*groups)
for s,o in zipped:
if s is None:
return True
if o is None:
return False
s_k, s_v = s[0], "".join(s[1])
o_k, o_v = o[0], "".join(o[1])
if s_k and o_k:
s_v, o_v = int(s_v), int(o_v)
if s_v == o_v:
continue
return s_v < o_v
elif s_k:
return True
elif o_k:
return False
else:
s_v, o_v = self.d(s_v), self.d(o_v)
if s_v == o_v:
continue
return s_v < o_v
return False
def __eq__(self, other):
if not isinstance(self, type(other)):
return NotImplemented
groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
zipped = itertools.zip_longest(*groups)
for s,o in zipped:
if s is None or o is None:
return False
s_k, s_v = s[0], "".join(s[1])
o_k, o_v = o[0], "".join(o[1])
if s_k and o_k:
s_v, o_v = int(s_v), int(o_v)
if s_v == o_v:
continue
return False
elif s_k or o_k:
return False
else:
s_v, o_v = self.d(s_v), self.d(o_v)
if s_v == o_v:
continue
return False
return True
__le__ = object.__le__
__ne__ = object.__ne__
__gt__ = object.__gt__
__ge__ = object.__ge__
import functools
import itertools
import enum
class OrderingType(enum.Enum):
PerWordSwapCase = lambda s: s.lower()+s.swapcase()
PerCharacterSwapCase = lambda s: "".join(c.lower()+c.swapcase() for c in s)
class NaturalOrdering:
@classmethod
def by(cls, ordering):
def wrapper(string):
return cls(string, ordering)
return wrapper
def __init__(self, string, ordering=OrderingType.PerCharacterSwapCase):
self.string = string
self.groups = [ (k,int("".join(v)))
if k else
(k,ordering("".join(v)))
for k,v in
itertools.groupby(string, str.isdigit)
]
def __repr__(self):
return "{}({})".format\
( type(self).__name__
, self.string
)
def __lesser(self, other, default):
if not isinstance(self, type(other)):
return NotImplemented
for s,o in itertools.zip_longest(self.groups, other.groups):
if s is None:
return True
if o is None:
return False
s_k, s_v = s
o_k, o_v = o
if s_k and o_k:
if s_v == o_v:
continue
return s_v < o_v
elif s_k:
return True
elif o_k:
return False
else:
if s_v == o_v:
continue
return s_v < o_v
return default
def __lt__(self, other):
return self.__lesser(other, default=False)
def __le__(self, other):
return self.__lesser(other, default=True)
def __eq__(self, other):
if not isinstance(self, type(other)):
return NotImplemented
for s,o in itertools.zip_longest(self.groups, other.groups):
if s is None or o is None:
return False
s_k, s_v = s
o_k, o_v = o
if s_k and o_k:
if s_v == o_v:
continue
return False
elif s_k or o_k:
return False
else:
if s_v == o_v:
continue
return False
return True
# functools.total_ordering doesn't create single-call wrappers if both
# __le__ and __lt__ exist, so do it manually.
def __gt__(self, other):
op_result = self.__le__(other)
if op_result is NotImplemented:
return op_result
return not op_result
def __ge__(self, other):
op_result = self.__lt__(other)
if op_result is NotImplemented:
return op_result
return not op_result
# __ne__ is the only implied ordering relationship, it automatically
# delegates to __eq__
>>> import natsort
>>> import timeit
>>> l1 = ['Apple', 'corn', 'apPlE', 'arbour', 'Corn', 'Banana', 'apple', 'banana']
>>> l2 = list(map(str, range(30)))
>>> l3 = ["{} {}".format(x,y) for x in l1 for y in l2]
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringA)', number=10000, globals=globals()))
362.4729259099986
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringB)', number=10000, globals=globals()))
189.7340817489967
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalOrdering.by(OrderingType.PerCharacterSwapCase))', number=10000, globals=globals()))
69.34636392899847
>>> print(timeit.timeit('natsort.natsorted(l3+["0"], alg=natsort.ns.GROUPLETTERS | natsort.ns.LOWERCASEFIRST)', number=10000, globals=globals()))
98.2531585780016
自然排序既相当复杂,又定义模糊。不要忘记事先运行unicodedata.normalize(…),并考虑使用str.casefold()而不是str.lower()。可能有一些微妙的编码问题我还没有考虑到。因此,我尝试性地推荐natsort库。我快速浏览了一下github存储库;代码维护非常出色。
All the algorithms I've seen depend on tricks such as duplicating and lowering characters, and swapping case. While this doubles the running time, an alternative would require a total natural ordering on the input character set. I don't think this is part of the unicode specification, and since there are many more unicode digits than [0-9], creating such a sorting would be equally daunting. If you want locale-aware comparisons, prepare your strings with locale.strxfrm per Python's Sorting HOW TO.