我有一个字符串列表,我想执行一个自然的字母排序。

例如,下面的列表是自然排序(我想要的):

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

下面是上面列表的“排序”版本(我使用sorted()得到的):

['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']

我在寻找一个排序函数它的行为和第一个一样。


当前回答

data = ['elm13', 'elm9', 'elm0', 'elm1', 'Elm11', 'Elm2', 'elm10']

让我们分析一下数据。所有元素的数字容量为2。在常见的字面部分“elm”中有3个字母。

所以,元素的最大长度是5。我们可以增加这个值以确保(例如,增加到8)。

记住这一点,我们有一个简单的解决方案:

data.sort(key=lambda x: '{0:0>8}'.format(x).lower())

没有正则表达式和外部库!

print(data)

>>> ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'elm13']

解释:

for elm in data:
    print('{0:0>8}'.format(elm).lower())

>>>
0000elm0
0000elm1
0000elm2
0000elm9
000elm10
000elm11
000elm13

其他回答

基于这里的答案,我写了一个natural_sorted函数,它的行为类似于内置函数的排序:

# Copyright (C) 2018, Benjamin Drung <bdrung@posteo.de>
#
# Permission to use, copy, modify, and/or distribute this software for any
# purpose with or without fee is hereby granted, provided that the above
# copyright notice and this permission notice appear in all copies.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
# WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
# ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
# WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
# ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
# OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

import re

def natural_sorted(iterable, key=None, reverse=False):
    """Return a new naturally sorted list from the items in *iterable*.

    The returned list is in natural sort order. The string is ordered
    lexicographically (using the Unicode code point number to order individual
    characters), except that multi-digit numbers are ordered as a single
    character.

    Has two optional arguments which must be specified as keyword arguments.

    *key* specifies a function of one argument that is used to extract a
    comparison key from each list element: ``key=str.lower``.  The default value
    is ``None`` (compare the elements directly).

    *reverse* is a boolean value.  If set to ``True``, then the list elements are
    sorted as if each comparison were reversed.

    The :func:`natural_sorted` function is guaranteed to be stable. A sort is
    stable if it guarantees not to change the relative order of elements that
    compare equal --- this is helpful for sorting in multiple passes (for
    example, sort by department, then by salary grade).
    """
    prog = re.compile(r"(\d+)")

    def alphanum_key(element):
        """Split given key in list of strings and digits"""
        return [int(c) if c.isdigit() else c for c in prog.split(key(element)
                if key else element)]

    return sorted(iterable, key=alphanum_key, reverse=reverse)

源代码也可以在我的GitHub片段存储库: https://github.com/bdrung/snippets/blob/master/natural_sorted.py

上面的答案对于上面给出的具体例子是有用的,但对于更普遍的自然排序问题,却遗漏了几个有用的例子。我刚刚被其中一个案例咬了一口,所以想出了一个更彻底的解决方案:

def natural_sort_key(string_or_number):
    """
    by Scott S. Lawton <scott@ProductArchitect.com> 2014-12-11; public domain and/or CC0 license

    handles cases where simple 'int' approach fails, e.g.
        ['0.501', '0.55'] floating point with different number of significant digits
        [0.01, 0.1, 1]    already numeric so regex and other string functions won't work (and aren't required)
        ['elm1', 'Elm2']  ASCII vs. letters (not case sensitive)
    """

    def try_float(astring):
        try:
            return float(astring)
        except:
            return astring

    if isinstance(string_or_number, basestring):
        string_or_number = string_or_number.lower()

        if len(re.findall('[.]\d', string_or_number)) <= 1:
            # assume a floating point value, e.g. to correctly sort ['0.501', '0.55']
            # '.' for decimal is locale-specific, e.g. correct for the Anglosphere and Asia but not continental Europe
            return [try_float(s) for s in re.split(r'([\d.]+)', string_or_number)]
        else:
            # assume distinct fields, e.g. IP address, phone number with '.', etc.
            # caveat: might want to first split by whitespace
            # TBD: for unicode, replace isdigit with isdecimal
            return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', string_or_number)]
    else:
        # consider: add code to recurse for lists/tuples and perhaps other iterables
        return string_or_number

测试代码和几个链接(在StackOverflow上和关闭)在这里: http://productarchitect.com/code/better-natural-sort.py

欢迎您的反馈。这并不是一个明确的解决方案;只是向前迈出了一步。

一个紧凑的解决方案,基于将字符串转换为List[Tuple(str, int)]。

Code

def string_to_pairs(s, pairs=re.compile(r"(\D*)(\d*)").findall):
    return [(text.lower(), int(digits or 0)) for (text, digits) in pairs(s)[:-1]]

示范

sorted(['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9'], key=string_to_pairs)

输出:

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

测试

转换

assert string_to_pairs("") == []
assert string_to_pairs("123") == [("", 123)]
assert string_to_pairs("abc") == [("abc", 0)]
assert string_to_pairs("123abc") == [("", 123), ("abc", 0)]
assert string_to_pairs("abc123") == [("abc", 123)]
assert string_to_pairs("123abc456") == [("", 123), ("abc", 456)]
assert string_to_pairs("abc123efg") == [("abc", 123), ("efg", 0)]

排序

# Some extracts from the test suite of the natsort library. Permalink:
# https://github.com/SethMMorton/natsort/blob/e3c32f5638bf3a0e9a23633495269bea0e75d379/tests/test_natsorted.py

sort_data = [
    (  # same as test_natsorted_can_sort_as_unsigned_ints_which_is_default()
        ["a50", "a51.", "a50.31", "a-50", "a50.4", "a5.034e1", "a50.300"],
        ["a5.034e1", "a50", "a50.4", "a50.31", "a50.300", "a51.", "a-50"],
    ),
    (  # same as test_natsorted_numbers_in_ascending_order()
        ["a2", "a5", "a9", "a1", "a4", "a10", "a6"],
        ["a1", "a2", "a4", "a5", "a6", "a9", "a10"],
    ),
    (  # same as test_natsorted_can_sort_as_version_numbers()
        ["1.9.9a", "1.11", "1.9.9b", "1.11.4", "1.10.1"],
        ["1.9.9a", "1.9.9b", "1.10.1", "1.11", "1.11.4"],
    ),
    (  # different from test_natsorted_handles_filesystem_paths()
        [
            "/p/Folder (10)/file.tar.gz",
            "/p/Folder (1)/file (1).tar.gz",
            "/p/Folder/file.x1.9.tar.gz",
            "/p/Folder (1)/file.tar.gz",
            "/p/Folder/file.x1.10.tar.gz",
        ],
        [
            "/p/Folder (1)/file (1).tar.gz",
            "/p/Folder (1)/file.tar.gz",
            "/p/Folder (10)/file.tar.gz",
            "/p/Folder/file.x1.9.tar.gz",
            "/p/Folder/file.x1.10.tar.gz",
        ],
    ),
    (  # same as test_natsorted_path_extensions_heuristic()
        [
            "Try.Me.Bug - 09 - One.Two.Three.[text].mkv",
            "Try.Me.Bug - 07 - One.Two.5.[text].mkv",
            "Try.Me.Bug - 08 - One.Two.Three[text].mkv",
        ],
        [
            "Try.Me.Bug - 07 - One.Two.5.[text].mkv",
            "Try.Me.Bug - 08 - One.Two.Three[text].mkv",
            "Try.Me.Bug - 09 - One.Two.Three.[text].mkv",
        ],
    ),
    (  # same as ns.IGNORECASE for test_natsorted_supports_case_handling()
        ["Apple", "corn", "Corn", "Banana", "apple", "banana"],
        ["Apple", "apple", "Banana", "banana", "corn", "Corn"],
    ),

]

for (given, expected) in sort_data:
    assert sorted(given, key=string_to_pairs) == expected

奖金

如果字符串混合了非ascii文本和数字,您可能会对将string_to_pairs()与我在其他地方给出的函数remove_diacritics()组合感兴趣。

一种选择是将字符串转换为元组,并使用展开形式http://wiki.answers.com/Q/What_does_expanded_form_mean替换数字

这样a90就会变成("a",90,0)而a1就会变成("a",1)

下面是一些示例代码(这不是很有效,因为它从数字中删除前导0的方式)

alist=["something1",
    "something12",
    "something17",
    "something2",
    "something25and_then_33",
    "something25and_then_34",
    "something29",
    "beta1.1",
    "beta2.3.0",
    "beta2.33.1",
    "a001",
    "a2",
    "z002",
    "z1"]

def key(k):
    nums=set(list("0123456789"))
        chars=set(list(k))
    chars=chars-nums
    for i in range(len(k)):
        for c in chars:
            k=k.replace(c+"0",c)
    l=list(k)
    base=10
    j=0
    for i in range(len(l)-1,-1,-1):
        try:
            l[i]=int(l[i])*base**j
            j+=1
        except:
            j=0
    l=tuple(l)
    print l
    return l

print sorted(alist,key=key)

输出:

('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 1)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 7)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 3)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 4)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 9)
('b', 'e', 't', 'a', 1, '.', 1)
('b', 'e', 't', 'a', 2, '.', 3, '.')
('b', 'e', 't', 'a', 2, '.', 30, 3, '.', 1)
('a', 1)
('a', 2)
('z', 2)
('z', 1)
['a001', 'a2', 'beta1.1', 'beta2.3.0', 'beta2.33.1', 'something1', 'something2', 'something12', 'something17', 'something25and_then_33', 'something25and_then_34', 'something29', 'z1', 'z002']

在@Mark Byers的回答之后,这里有一个接受关键参数的适应,并且更符合pep8。

def natsorted(seq, key=None):
    def convert(text):
        return int(text) if text.isdigit() else text

    def alphanum(obj):
        if key is not None:
            return [convert(c) for c in re.split(r'([0-9]+)', key(obj))]
        return [convert(c) for c in re.split(r'([0-9]+)', obj)]

    return sorted(seq, key=alphanum)

我还做了一个Gist