我有一个字符串列表,我想执行一个自然的字母排序。

例如,下面的列表是自然排序(我想要的):

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

下面是上面列表的“排序”版本(我使用sorted()得到的):

['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']

我在寻找一个排序函数它的行为和第一个一样。


当前回答

很可能functools.cmp_to_key()与python的sort的底层实现密切相关。此外,cmp参数是遗留的。现代的方法是将输入项转换为支持所需的丰富比较操作的对象。

在CPython 2下。X,即使没有实现各自的富比较操作符,也可以对不同类型的对象排序。在CPython 3下。X,不同类型的对象必须显式地支持比较。参见Python如何比较字符串和int?链接到官方文件。大多数答案都依赖于这种隐含的顺序。切换到Python 3。X将需要一个新的类型来实现和统一数字和字符串之间的比较。

Python 2.7.12 (default, Sep 29 2016, 13:30:34) 
>>> (0,"foo") < ("foo",0)
True  
Python 3.5.2 (default, Oct 14 2016, 12:54:53) 
>>> (0,"foo") < ("foo",0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  TypeError: unorderable types: int() < str()

有三种不同的方法。第一种使用嵌套类来利用Python的Iterable比较算法。第二个函数将这个嵌套展开到单个类中。第三种方法放弃了str的子类化,专注于性能。所有都是有时间的;第二辆快了一倍,第三辆快了近六倍。对str进行子类化并不是必需的,而且首先可能是一个坏主意,但它确实带来了某些便利。

排序字符被复制以强制按大小写排序,并交换大小写以强制小写字母优先排序;这就是“自然排序”的典型定义。我无法决定分组的类型;有些人可能更喜欢以下选项,这也会带来显著的性能优势:

d = lambda s: s.lower()+s.swapcase()

在使用的地方,比较运算符被设置为object的运算符,这样它们就不会被functools. total_orders忽略。

import functools
import itertools


@functools.total_ordering
class NaturalStringA(str):
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , super().__repr__()
            )
    d = lambda c, s: [ c.NaturalStringPart("".join(v))
                        for k,v in
                       itertools.groupby(s, c.isdigit)
                     ]
    d = classmethod(d)
    @functools.total_ordering
    class NaturalStringPart(str):
        d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
        d = staticmethod(d)
        def __lt__(self, other):
            if not isinstance(self, type(other)):
                return NotImplemented
            try:
                return int(self) < int(other)
            except ValueError:
                if self.isdigit():
                    return True
                elif other.isdigit():
                    return False
                else:
                    return self.d(self) < self.d(other)
        def __eq__(self, other):
            if not isinstance(self, type(other)):
                return NotImplemented
            try:
                return int(self) == int(other)
            except ValueError:
                if self.isdigit() or other.isdigit():
                    return False
                else:
                    return self.d(self) == self.d(other)
        __le__ = object.__le__
        __ne__ = object.__ne__
        __gt__ = object.__gt__
        __ge__ = object.__ge__
    def __lt__(self, other):
        return self.d(self) < self.d(other)
    def __eq__(self, other):
        return self.d(self) == self.d(other)
    __le__ = object.__le__
    __ne__ = object.__ne__
    __gt__ = object.__gt__
    __ge__ = object.__ge__
import functools
import itertools


@functools.total_ordering
class NaturalStringB(str):
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , super().__repr__()
            )
    d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
    d = staticmethod(d)
    def __lt__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
        zipped = itertools.zip_longest(*groups)
        for s,o in zipped:
            if s is None:
                return True
            if o is None:
                return False
            s_k, s_v = s[0], "".join(s[1])
            o_k, o_v = o[0], "".join(o[1])
            if s_k and o_k:
                s_v, o_v = int(s_v), int(o_v)
                if s_v == o_v:
                    continue
                return s_v < o_v
            elif s_k:
                return True
            elif o_k:
                return False
            else:
                s_v, o_v = self.d(s_v), self.d(o_v)
                if s_v == o_v:
                    continue
                return s_v < o_v
        return False
    def __eq__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
        zipped = itertools.zip_longest(*groups)
        for s,o in zipped:
            if s is None or o is None:
                return False
            s_k, s_v = s[0], "".join(s[1])
            o_k, o_v = o[0], "".join(o[1])
            if s_k and o_k:
                s_v, o_v = int(s_v), int(o_v)
                if s_v == o_v:
                    continue
                return False
            elif s_k or o_k:
                return False
            else:
                s_v, o_v = self.d(s_v), self.d(o_v)
                if s_v == o_v:
                    continue
                return False
        return True
    __le__ = object.__le__
    __ne__ = object.__ne__
    __gt__ = object.__gt__
    __ge__ = object.__ge__
import functools
import itertools
import enum


class OrderingType(enum.Enum):
    PerWordSwapCase         = lambda s: s.lower()+s.swapcase()
    PerCharacterSwapCase    = lambda s: "".join(c.lower()+c.swapcase() for c in s)


class NaturalOrdering:
    @classmethod
    def by(cls, ordering):
        def wrapper(string):
            return cls(string, ordering)
        return wrapper
    def __init__(self, string, ordering=OrderingType.PerCharacterSwapCase):
        self.string = string
        self.groups = [ (k,int("".join(v)))
                            if k else
                        (k,ordering("".join(v)))
                            for k,v in
                        itertools.groupby(string, str.isdigit)
                      ]
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , self.string
            )
    def __lesser(self, other, default):
        if not isinstance(self, type(other)):
            return NotImplemented
        for s,o in itertools.zip_longest(self.groups, other.groups):
            if s is None:
                return True
            if o is None:
                return False
            s_k, s_v = s
            o_k, o_v = o
            if s_k and o_k:
                if s_v == o_v:
                    continue
                return s_v < o_v
            elif s_k:
                return True
            elif o_k:
                return False
            else:
                if s_v == o_v:
                    continue
                return s_v < o_v
        return default
    def __lt__(self, other):
        return self.__lesser(other, default=False)
    def __le__(self, other):
        return self.__lesser(other, default=True)
    def __eq__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        for s,o in itertools.zip_longest(self.groups, other.groups):
            if s is None or o is None:
                return False
            s_k, s_v = s
            o_k, o_v = o
            if s_k and o_k:
                if s_v == o_v:
                    continue
                return False
            elif s_k or o_k:
                return False
            else:
                if s_v == o_v:
                    continue
                return False
        return True
    # functools.total_ordering doesn't create single-call wrappers if both
    # __le__ and __lt__ exist, so do it manually.
    def __gt__(self, other):
        op_result = self.__le__(other)
        if op_result is NotImplemented:
            return op_result
        return not op_result
    def __ge__(self, other):
        op_result = self.__lt__(other)
        if op_result is NotImplemented:
            return op_result
        return not op_result
    # __ne__ is the only implied ordering relationship, it automatically
    # delegates to __eq__
>>> import natsort
>>> import timeit
>>> l1 = ['Apple', 'corn', 'apPlE', 'arbour', 'Corn', 'Banana', 'apple', 'banana']
>>> l2 = list(map(str, range(30)))
>>> l3 = ["{} {}".format(x,y) for x in l1 for y in l2]
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringA)', number=10000, globals=globals()))
362.4729259099986
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringB)', number=10000, globals=globals()))
189.7340817489967
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalOrdering.by(OrderingType.PerCharacterSwapCase))', number=10000, globals=globals()))
69.34636392899847
>>> print(timeit.timeit('natsort.natsorted(l3+["0"], alg=natsort.ns.GROUPLETTERS | natsort.ns.LOWERCASEFIRST)', number=10000, globals=globals()))
98.2531585780016

自然排序既相当复杂,又定义模糊。不要忘记事先运行unicodedata.normalize(…),并考虑使用str.casefold()而不是str.lower()。可能有一些微妙的编码问题我还没有考虑到。因此,我尝试性地推荐natsort库。我快速浏览了一下github存储库;代码维护非常出色。

All the algorithms I've seen depend on tricks such as duplicating and lowering characters, and swapping case. While this doubles the running time, an alternative would require a total natural ordering on the input character set. I don't think this is part of the unicode specification, and since there are many more unicode digits than [0-9], creating such a sorting would be equally daunting. If you want locale-aware comparisons, prepare your strings with locale.strxfrm per Python's Sorting HOW TO.

其他回答

考虑到:

data = ['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']

类似于SergO的解决方案,没有外部库的1-liner将是:

data.sort(key=lambda x: int(x[3:]))

or

sorted_data = sorted(data, key=lambda x: int(x[3:]))

解释:

该解决方案使用sort的关键特性来定义将用于排序的函数。因为我们知道每个数据条目前面都有'elm',排序函数将字符串中第三个字符之后的部分(即int(x[3:]))转换为整数。如果数据的数值部分在不同的位置,那么函数的这部分将不得不改变。

我写了一个基于http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html的函数,它增加了传递自己的“键”参数的能力。我需要这样才能执行包含更复杂对象(不仅仅是字符串)的列表的自然排序。

import re

def natural_sort(list, key=lambda s:s):
    """
    Sort the list into natural alphanumeric order.
    """
    def get_alphanum_key_func(key):
        convert = lambda text: int(text) if text.isdigit() else text 
        return lambda s: [convert(c) for c in re.split('([0-9]+)', key(s))]
    sort_key = get_alphanum_key_func(key)
    list.sort(key=sort_key)

例如:

my_list = [{'name':'b'}, {'name':'10'}, {'name':'a'}, {'name':'1'}, {'name':'9'}]
natural_sort(my_list, key=lambda x: x['name'])
print my_list
[{'name': '1'}, {'name': '9'}, {'name': '10'}, {'name': 'a'}, {'name': 'b'}]
a = ['H1', 'H100', 'H10', 'H3', 'H2', 'H6', 'H11', 'H50', 'H5', 'H99', 'H8']
b = ''
c = []

def bubble(bad_list):#bubble sort method
        length = len(bad_list) - 1
        sorted = False

        while not sorted:
                sorted = True
                for i in range(length):
                        if bad_list[i] > bad_list[i+1]:
                                sorted = False
                                bad_list[i], bad_list[i+1] = bad_list[i+1], bad_list[i] #sort the integer list 
                                a[i], a[i+1] = a[i+1], a[i] #sort the main list based on the integer list index value

for a_string in a: #extract the number in the string character by character
        for letter in a_string:
                if letter.isdigit():
                        #print letter
                        b += letter
        c.append(b)
        b = ''

print 'Before sorting....'
print a
c = map(int, c) #converting string list into number list
print c
bubble(c)

print 'After sorting....'
print c
print a

应答:

气泡排序作业

如何在python中一次读一个字母的字符串

这是一个更高级的解决方案,由Claudiu和Mark Byers改进:

它使用casefold()而不是lower()来匹配字符串 您可以传递另一个键lambda来选择一个内部元素(就像您习惯使用普通排序函数一样) 它当然适用于列表。Sort, sorted, max,等等。

def natural_sort(key=None, _nsre=re.compile('([0-9]+)')):
    return lambda x: [int(text) if text.isdigit() else text.casefold()
            for text in _nsre.split(key(x) if key else x)]

使用示例:

# Original solution
data.sort(key=natural_sort)

# Select an additional key
image_files.sort(key=natural_sort(lambda x: x.original_filename))

我使用的算法是padzero_with_lower,定义如下:

import re

def padzero_with_lower(s):
    return re.sub(r'\d+', lambda m: m.group(0).rjust(10, '0'), s).lower()

该算法发现:

查找并填充任意长度的数字,直到足够大的长度,例如10 然后,它将字符串转换为小写

下面是一个用法示例:

print(padzero_with_lower('file1.txt'))   # file0000000001.txt
print(padzero_with_lower('file12.txt'))  # file0000000012.txt
print(padzero_with_lower('file23.txt'))  # file0000000023.txt
print(padzero_with_lower('file123.txt')) # file0000000123.txt
print(padzero_with_lower('file301.txt')) # file0000000301.txt
print(padzero_with_lower('Dir2/file15.txt'))  # dir0000000002/file0000000015.txt
print(padzero_with_lower('dir2/file123.txt')) # dir0000000002/file0000000123.txt
print(padzero_with_lower('dir15/file2.txt'))  # dir0000000015/file0000000002.txt
print(padzero_with_lower('Dir15/file15.txt')) # dir0000000015/file0000000015.txt
print(padzero_with_lower('elm0'))  # elm0000000000
print(padzero_with_lower('elm1'))  # elm0000000001
print(padzero_with_lower('Elm2'))  # elm0000000002
print(padzero_with_lower('elm9'))  # elm0000000009
print(padzero_with_lower('elm10')) # elm0000000010
print(padzero_with_lower('Elm11')) # elm0000000011 
print(padzero_with_lower('Elm12')) # elm0000000012
print(padzero_with_lower('elm13')) # elm0000000013

测试了这个函数后,我们现在可以使用它作为我们的键。

lis = ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
lis.sort(key=padzero_with_lower)
print(lis)
# Output: ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']