我有一个字符串列表,我想执行一个自然的字母排序。

例如,下面的列表是自然排序(我想要的):

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

下面是上面列表的“排序”版本(我使用sorted()得到的):

['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']

我在寻找一个排序函数它的行为和第一个一样。


当前回答

我使用的算法是padzero_with_lower,定义如下:

import re

def padzero_with_lower(s):
    return re.sub(r'\d+', lambda m: m.group(0).rjust(10, '0'), s).lower()

该算法发现:

查找并填充任意长度的数字,直到足够大的长度,例如10 然后,它将字符串转换为小写

下面是一个用法示例:

print(padzero_with_lower('file1.txt'))   # file0000000001.txt
print(padzero_with_lower('file12.txt'))  # file0000000012.txt
print(padzero_with_lower('file23.txt'))  # file0000000023.txt
print(padzero_with_lower('file123.txt')) # file0000000123.txt
print(padzero_with_lower('file301.txt')) # file0000000301.txt
print(padzero_with_lower('Dir2/file15.txt'))  # dir0000000002/file0000000015.txt
print(padzero_with_lower('dir2/file123.txt')) # dir0000000002/file0000000123.txt
print(padzero_with_lower('dir15/file2.txt'))  # dir0000000015/file0000000002.txt
print(padzero_with_lower('Dir15/file15.txt')) # dir0000000015/file0000000015.txt
print(padzero_with_lower('elm0'))  # elm0000000000
print(padzero_with_lower('elm1'))  # elm0000000001
print(padzero_with_lower('Elm2'))  # elm0000000002
print(padzero_with_lower('elm9'))  # elm0000000009
print(padzero_with_lower('elm10')) # elm0000000010
print(padzero_with_lower('Elm11')) # elm0000000011 
print(padzero_with_lower('Elm12')) # elm0000000012
print(padzero_with_lower('elm13')) # elm0000000013

测试了这个函数后,我们现在可以使用它作为我们的键。

lis = ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
lis.sort(key=padzero_with_lower)
print(lis)
# Output: ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

其他回答

现在是一些更优雅的东西(pythonic)

-只要碰一下

有很多实现,虽然有些已经接近,但没有一个能完全捕获现代python所提供的优雅。

使用python测试(3.5.1) 包含了一个额外的列表,以演示当 数字在字符串中间 没有测试,但是,我假设如果您的列表是相当大的,那么事先编译正则表达式会更有效 如果这是一个错误的假设,我相信有人会纠正我

罢工

from re import compile, split    
dre = compile(r'(\d+)')
mylist.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])

完整代码

#!/usr/bin/python3
# coding=utf-8
"""
Natural-Sort Test
"""

from re import compile, split

dre = compile(r'(\d+)')
mylist = ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13', 'elm']
mylist2 = ['e0lm', 'e1lm', 'E2lm', 'e9lm', 'e10lm', 'E12lm', 'e13lm', 'elm', 'e01lm']

mylist.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])
mylist2.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])

print(mylist)  
  # ['elm', 'elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
print(mylist2)  
  # ['e0lm', 'e1lm', 'e01lm', 'E2lm', 'e9lm', 'e10lm', 'E12lm', 'e13lm', 'elm']

使用时注意

从操作系统。路径导入分割 您需要区分导入

灵感来自

Python文档-如何排序 人类排序:自然排序顺序 人的分类 这篇文章的贡献者/评论员和引用帖子

很可能functools.cmp_to_key()与python的sort的底层实现密切相关。此外,cmp参数是遗留的。现代的方法是将输入项转换为支持所需的丰富比较操作的对象。

在CPython 2下。X,即使没有实现各自的富比较操作符,也可以对不同类型的对象排序。在CPython 3下。X,不同类型的对象必须显式地支持比较。参见Python如何比较字符串和int?链接到官方文件。大多数答案都依赖于这种隐含的顺序。切换到Python 3。X将需要一个新的类型来实现和统一数字和字符串之间的比较。

Python 2.7.12 (default, Sep 29 2016, 13:30:34) 
>>> (0,"foo") < ("foo",0)
True  
Python 3.5.2 (default, Oct 14 2016, 12:54:53) 
>>> (0,"foo") < ("foo",0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  TypeError: unorderable types: int() < str()

有三种不同的方法。第一种使用嵌套类来利用Python的Iterable比较算法。第二个函数将这个嵌套展开到单个类中。第三种方法放弃了str的子类化,专注于性能。所有都是有时间的;第二辆快了一倍,第三辆快了近六倍。对str进行子类化并不是必需的,而且首先可能是一个坏主意,但它确实带来了某些便利。

排序字符被复制以强制按大小写排序,并交换大小写以强制小写字母优先排序;这就是“自然排序”的典型定义。我无法决定分组的类型;有些人可能更喜欢以下选项,这也会带来显著的性能优势:

d = lambda s: s.lower()+s.swapcase()

在使用的地方,比较运算符被设置为object的运算符,这样它们就不会被functools. total_orders忽略。

import functools
import itertools


@functools.total_ordering
class NaturalStringA(str):
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , super().__repr__()
            )
    d = lambda c, s: [ c.NaturalStringPart("".join(v))
                        for k,v in
                       itertools.groupby(s, c.isdigit)
                     ]
    d = classmethod(d)
    @functools.total_ordering
    class NaturalStringPart(str):
        d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
        d = staticmethod(d)
        def __lt__(self, other):
            if not isinstance(self, type(other)):
                return NotImplemented
            try:
                return int(self) < int(other)
            except ValueError:
                if self.isdigit():
                    return True
                elif other.isdigit():
                    return False
                else:
                    return self.d(self) < self.d(other)
        def __eq__(self, other):
            if not isinstance(self, type(other)):
                return NotImplemented
            try:
                return int(self) == int(other)
            except ValueError:
                if self.isdigit() or other.isdigit():
                    return False
                else:
                    return self.d(self) == self.d(other)
        __le__ = object.__le__
        __ne__ = object.__ne__
        __gt__ = object.__gt__
        __ge__ = object.__ge__
    def __lt__(self, other):
        return self.d(self) < self.d(other)
    def __eq__(self, other):
        return self.d(self) == self.d(other)
    __le__ = object.__le__
    __ne__ = object.__ne__
    __gt__ = object.__gt__
    __ge__ = object.__ge__
import functools
import itertools


@functools.total_ordering
class NaturalStringB(str):
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , super().__repr__()
            )
    d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
    d = staticmethod(d)
    def __lt__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
        zipped = itertools.zip_longest(*groups)
        for s,o in zipped:
            if s is None:
                return True
            if o is None:
                return False
            s_k, s_v = s[0], "".join(s[1])
            o_k, o_v = o[0], "".join(o[1])
            if s_k and o_k:
                s_v, o_v = int(s_v), int(o_v)
                if s_v == o_v:
                    continue
                return s_v < o_v
            elif s_k:
                return True
            elif o_k:
                return False
            else:
                s_v, o_v = self.d(s_v), self.d(o_v)
                if s_v == o_v:
                    continue
                return s_v < o_v
        return False
    def __eq__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
        zipped = itertools.zip_longest(*groups)
        for s,o in zipped:
            if s is None or o is None:
                return False
            s_k, s_v = s[0], "".join(s[1])
            o_k, o_v = o[0], "".join(o[1])
            if s_k and o_k:
                s_v, o_v = int(s_v), int(o_v)
                if s_v == o_v:
                    continue
                return False
            elif s_k or o_k:
                return False
            else:
                s_v, o_v = self.d(s_v), self.d(o_v)
                if s_v == o_v:
                    continue
                return False
        return True
    __le__ = object.__le__
    __ne__ = object.__ne__
    __gt__ = object.__gt__
    __ge__ = object.__ge__
import functools
import itertools
import enum


class OrderingType(enum.Enum):
    PerWordSwapCase         = lambda s: s.lower()+s.swapcase()
    PerCharacterSwapCase    = lambda s: "".join(c.lower()+c.swapcase() for c in s)


class NaturalOrdering:
    @classmethod
    def by(cls, ordering):
        def wrapper(string):
            return cls(string, ordering)
        return wrapper
    def __init__(self, string, ordering=OrderingType.PerCharacterSwapCase):
        self.string = string
        self.groups = [ (k,int("".join(v)))
                            if k else
                        (k,ordering("".join(v)))
                            for k,v in
                        itertools.groupby(string, str.isdigit)
                      ]
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , self.string
            )
    def __lesser(self, other, default):
        if not isinstance(self, type(other)):
            return NotImplemented
        for s,o in itertools.zip_longest(self.groups, other.groups):
            if s is None:
                return True
            if o is None:
                return False
            s_k, s_v = s
            o_k, o_v = o
            if s_k and o_k:
                if s_v == o_v:
                    continue
                return s_v < o_v
            elif s_k:
                return True
            elif o_k:
                return False
            else:
                if s_v == o_v:
                    continue
                return s_v < o_v
        return default
    def __lt__(self, other):
        return self.__lesser(other, default=False)
    def __le__(self, other):
        return self.__lesser(other, default=True)
    def __eq__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        for s,o in itertools.zip_longest(self.groups, other.groups):
            if s is None or o is None:
                return False
            s_k, s_v = s
            o_k, o_v = o
            if s_k and o_k:
                if s_v == o_v:
                    continue
                return False
            elif s_k or o_k:
                return False
            else:
                if s_v == o_v:
                    continue
                return False
        return True
    # functools.total_ordering doesn't create single-call wrappers if both
    # __le__ and __lt__ exist, so do it manually.
    def __gt__(self, other):
        op_result = self.__le__(other)
        if op_result is NotImplemented:
            return op_result
        return not op_result
    def __ge__(self, other):
        op_result = self.__lt__(other)
        if op_result is NotImplemented:
            return op_result
        return not op_result
    # __ne__ is the only implied ordering relationship, it automatically
    # delegates to __eq__
>>> import natsort
>>> import timeit
>>> l1 = ['Apple', 'corn', 'apPlE', 'arbour', 'Corn', 'Banana', 'apple', 'banana']
>>> l2 = list(map(str, range(30)))
>>> l3 = ["{} {}".format(x,y) for x in l1 for y in l2]
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringA)', number=10000, globals=globals()))
362.4729259099986
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringB)', number=10000, globals=globals()))
189.7340817489967
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalOrdering.by(OrderingType.PerCharacterSwapCase))', number=10000, globals=globals()))
69.34636392899847
>>> print(timeit.timeit('natsort.natsorted(l3+["0"], alg=natsort.ns.GROUPLETTERS | natsort.ns.LOWERCASEFIRST)', number=10000, globals=globals()))
98.2531585780016

自然排序既相当复杂,又定义模糊。不要忘记事先运行unicodedata.normalize(…),并考虑使用str.casefold()而不是str.lower()。可能有一些微妙的编码问题我还没有考虑到。因此,我尝试性地推荐natsort库。我快速浏览了一下github存储库;代码维护非常出色。

All the algorithms I've seen depend on tricks such as duplicating and lowering characters, and swapping case. While this doubles the running time, an alternative would require a total natural ordering on the input character set. I don't think this is part of the unicode specification, and since there are many more unicode digits than [0-9], creating such a sorting would be equally daunting. If you want locale-aware comparisons, prepare your strings with locale.strxfrm per Python's Sorting HOW TO.

试试这个:

import re

def natural_sort(l): 
    convert = lambda text: int(text) if text.isdigit() else text.lower()
    alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
    return sorted(l, key=alphanum_key)

输出:

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

代码改编自这里:排序人类:自然排序顺序。

一个紧凑的解决方案,基于将字符串转换为List[Tuple(str, int)]。

Code

def string_to_pairs(s, pairs=re.compile(r"(\D*)(\d*)").findall):
    return [(text.lower(), int(digits or 0)) for (text, digits) in pairs(s)[:-1]]

示范

sorted(['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9'], key=string_to_pairs)

输出:

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

测试

转换

assert string_to_pairs("") == []
assert string_to_pairs("123") == [("", 123)]
assert string_to_pairs("abc") == [("abc", 0)]
assert string_to_pairs("123abc") == [("", 123), ("abc", 0)]
assert string_to_pairs("abc123") == [("abc", 123)]
assert string_to_pairs("123abc456") == [("", 123), ("abc", 456)]
assert string_to_pairs("abc123efg") == [("abc", 123), ("efg", 0)]

排序

# Some extracts from the test suite of the natsort library. Permalink:
# https://github.com/SethMMorton/natsort/blob/e3c32f5638bf3a0e9a23633495269bea0e75d379/tests/test_natsorted.py

sort_data = [
    (  # same as test_natsorted_can_sort_as_unsigned_ints_which_is_default()
        ["a50", "a51.", "a50.31", "a-50", "a50.4", "a5.034e1", "a50.300"],
        ["a5.034e1", "a50", "a50.4", "a50.31", "a50.300", "a51.", "a-50"],
    ),
    (  # same as test_natsorted_numbers_in_ascending_order()
        ["a2", "a5", "a9", "a1", "a4", "a10", "a6"],
        ["a1", "a2", "a4", "a5", "a6", "a9", "a10"],
    ),
    (  # same as test_natsorted_can_sort_as_version_numbers()
        ["1.9.9a", "1.11", "1.9.9b", "1.11.4", "1.10.1"],
        ["1.9.9a", "1.9.9b", "1.10.1", "1.11", "1.11.4"],
    ),
    (  # different from test_natsorted_handles_filesystem_paths()
        [
            "/p/Folder (10)/file.tar.gz",
            "/p/Folder (1)/file (1).tar.gz",
            "/p/Folder/file.x1.9.tar.gz",
            "/p/Folder (1)/file.tar.gz",
            "/p/Folder/file.x1.10.tar.gz",
        ],
        [
            "/p/Folder (1)/file (1).tar.gz",
            "/p/Folder (1)/file.tar.gz",
            "/p/Folder (10)/file.tar.gz",
            "/p/Folder/file.x1.9.tar.gz",
            "/p/Folder/file.x1.10.tar.gz",
        ],
    ),
    (  # same as test_natsorted_path_extensions_heuristic()
        [
            "Try.Me.Bug - 09 - One.Two.Three.[text].mkv",
            "Try.Me.Bug - 07 - One.Two.5.[text].mkv",
            "Try.Me.Bug - 08 - One.Two.Three[text].mkv",
        ],
        [
            "Try.Me.Bug - 07 - One.Two.5.[text].mkv",
            "Try.Me.Bug - 08 - One.Two.Three[text].mkv",
            "Try.Me.Bug - 09 - One.Two.Three.[text].mkv",
        ],
    ),
    (  # same as ns.IGNORECASE for test_natsorted_supports_case_handling()
        ["Apple", "corn", "Corn", "Banana", "apple", "banana"],
        ["Apple", "apple", "Banana", "banana", "corn", "Corn"],
    ),

]

for (given, expected) in sort_data:
    assert sorted(given, key=string_to_pairs) == expected

奖金

如果字符串混合了非ascii文本和数字,您可能会对将string_to_pairs()与我在其他地方给出的函数remove_diacritics()组合感兴趣。

>>> import re
>>> sorted(lst, key=lambda x: int(re.findall(r'\d+$', x)[0]))
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']