我有一个字符串列表,我想执行一个自然的字母排序。
例如,下面的列表是自然排序(我想要的):
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
下面是上面列表的“排序”版本(我使用sorted()得到的):
['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
我在寻找一个排序函数它的行为和第一个一样。
我有一个字符串列表,我想执行一个自然的字母排序。
例如,下面的列表是自然排序(我想要的):
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
下面是上面列表的“排序”版本(我使用sorted()得到的):
['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
我在寻找一个排序函数它的行为和第一个一样。
当前回答
data = ['elm13', 'elm9', 'elm0', 'elm1', 'Elm11', 'Elm2', 'elm10']
让我们分析一下数据。所有元素的数字容量为2。在常见的字面部分“elm”中有3个字母。
所以,元素的最大长度是5。我们可以增加这个值以确保(例如,增加到8)。
记住这一点,我们有一个简单的解决方案:
data.sort(key=lambda x: '{0:0>8}'.format(x).lower())
没有正则表达式和外部库!
print(data)
>>> ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'elm13']
解释:
for elm in data:
print('{0:0>8}'.format(elm).lower())
>>>
0000elm0
0000elm1
0000elm2
0000elm9
000elm10
000elm11
000elm13
其他回答
基于这里的答案,我写了一个natural_sorted函数,它的行为类似于内置函数的排序:
# Copyright (C) 2018, Benjamin Drung <bdrung@posteo.de>
#
# Permission to use, copy, modify, and/or distribute this software for any
# purpose with or without fee is hereby granted, provided that the above
# copyright notice and this permission notice appear in all copies.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
# WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
# ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
# WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
# ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
# OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
import re
def natural_sorted(iterable, key=None, reverse=False):
"""Return a new naturally sorted list from the items in *iterable*.
The returned list is in natural sort order. The string is ordered
lexicographically (using the Unicode code point number to order individual
characters), except that multi-digit numbers are ordered as a single
character.
Has two optional arguments which must be specified as keyword arguments.
*key* specifies a function of one argument that is used to extract a
comparison key from each list element: ``key=str.lower``. The default value
is ``None`` (compare the elements directly).
*reverse* is a boolean value. If set to ``True``, then the list elements are
sorted as if each comparison were reversed.
The :func:`natural_sorted` function is guaranteed to be stable. A sort is
stable if it guarantees not to change the relative order of elements that
compare equal --- this is helpful for sorting in multiple passes (for
example, sort by department, then by salary grade).
"""
prog = re.compile(r"(\d+)")
def alphanum_key(element):
"""Split given key in list of strings and digits"""
return [int(c) if c.isdigit() else c for c in prog.split(key(element)
if key else element)]
return sorted(iterable, key=alphanum_key, reverse=reverse)
源代码也可以在我的GitHub片段存储库: https://github.com/bdrung/snippets/blob/master/natural_sorted.py
上面的答案对于上面给出的具体例子是有用的,但对于更普遍的自然排序问题,却遗漏了几个有用的例子。我刚刚被其中一个案例咬了一口,所以想出了一个更彻底的解决方案:
def natural_sort_key(string_or_number):
"""
by Scott S. Lawton <scott@ProductArchitect.com> 2014-12-11; public domain and/or CC0 license
handles cases where simple 'int' approach fails, e.g.
['0.501', '0.55'] floating point with different number of significant digits
[0.01, 0.1, 1] already numeric so regex and other string functions won't work (and aren't required)
['elm1', 'Elm2'] ASCII vs. letters (not case sensitive)
"""
def try_float(astring):
try:
return float(astring)
except:
return astring
if isinstance(string_or_number, basestring):
string_or_number = string_or_number.lower()
if len(re.findall('[.]\d', string_or_number)) <= 1:
# assume a floating point value, e.g. to correctly sort ['0.501', '0.55']
# '.' for decimal is locale-specific, e.g. correct for the Anglosphere and Asia but not continental Europe
return [try_float(s) for s in re.split(r'([\d.]+)', string_or_number)]
else:
# assume distinct fields, e.g. IP address, phone number with '.', etc.
# caveat: might want to first split by whitespace
# TBD: for unicode, replace isdigit with isdecimal
return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', string_or_number)]
else:
# consider: add code to recurse for lists/tuples and perhaps other iterables
return string_or_number
测试代码和几个链接(在StackOverflow上和关闭)在这里: http://productarchitect.com/code/better-natural-sort.py
欢迎您的反馈。这并不是一个明确的解决方案;只是向前迈出了一步。
一个紧凑的解决方案,基于将字符串转换为List[Tuple(str, int)]。
Code
def string_to_pairs(s, pairs=re.compile(r"(\D*)(\d*)").findall):
return [(text.lower(), int(digits or 0)) for (text, digits) in pairs(s)[:-1]]
示范
sorted(['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9'], key=string_to_pairs)
输出:
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
测试
转换
assert string_to_pairs("") == []
assert string_to_pairs("123") == [("", 123)]
assert string_to_pairs("abc") == [("abc", 0)]
assert string_to_pairs("123abc") == [("", 123), ("abc", 0)]
assert string_to_pairs("abc123") == [("abc", 123)]
assert string_to_pairs("123abc456") == [("", 123), ("abc", 456)]
assert string_to_pairs("abc123efg") == [("abc", 123), ("efg", 0)]
排序
# Some extracts from the test suite of the natsort library. Permalink:
# https://github.com/SethMMorton/natsort/blob/e3c32f5638bf3a0e9a23633495269bea0e75d379/tests/test_natsorted.py
sort_data = [
( # same as test_natsorted_can_sort_as_unsigned_ints_which_is_default()
["a50", "a51.", "a50.31", "a-50", "a50.4", "a5.034e1", "a50.300"],
["a5.034e1", "a50", "a50.4", "a50.31", "a50.300", "a51.", "a-50"],
),
( # same as test_natsorted_numbers_in_ascending_order()
["a2", "a5", "a9", "a1", "a4", "a10", "a6"],
["a1", "a2", "a4", "a5", "a6", "a9", "a10"],
),
( # same as test_natsorted_can_sort_as_version_numbers()
["1.9.9a", "1.11", "1.9.9b", "1.11.4", "1.10.1"],
["1.9.9a", "1.9.9b", "1.10.1", "1.11", "1.11.4"],
),
( # different from test_natsorted_handles_filesystem_paths()
[
"/p/Folder (10)/file.tar.gz",
"/p/Folder (1)/file (1).tar.gz",
"/p/Folder/file.x1.9.tar.gz",
"/p/Folder (1)/file.tar.gz",
"/p/Folder/file.x1.10.tar.gz",
],
[
"/p/Folder (1)/file (1).tar.gz",
"/p/Folder (1)/file.tar.gz",
"/p/Folder (10)/file.tar.gz",
"/p/Folder/file.x1.9.tar.gz",
"/p/Folder/file.x1.10.tar.gz",
],
),
( # same as test_natsorted_path_extensions_heuristic()
[
"Try.Me.Bug - 09 - One.Two.Three.[text].mkv",
"Try.Me.Bug - 07 - One.Two.5.[text].mkv",
"Try.Me.Bug - 08 - One.Two.Three[text].mkv",
],
[
"Try.Me.Bug - 07 - One.Two.5.[text].mkv",
"Try.Me.Bug - 08 - One.Two.Three[text].mkv",
"Try.Me.Bug - 09 - One.Two.Three.[text].mkv",
],
),
( # same as ns.IGNORECASE for test_natsorted_supports_case_handling()
["Apple", "corn", "Corn", "Banana", "apple", "banana"],
["Apple", "apple", "Banana", "banana", "corn", "Corn"],
),
]
for (given, expected) in sort_data:
assert sorted(given, key=string_to_pairs) == expected
奖金
如果字符串混合了非ascii文本和数字,您可能会对将string_to_pairs()与我在其他地方给出的函数remove_diacritics()组合感兴趣。
一种选择是将字符串转换为元组,并使用展开形式http://wiki.answers.com/Q/What_does_expanded_form_mean替换数字
这样a90就会变成("a",90,0)而a1就会变成("a",1)
下面是一些示例代码(这不是很有效,因为它从数字中删除前导0的方式)
alist=["something1",
"something12",
"something17",
"something2",
"something25and_then_33",
"something25and_then_34",
"something29",
"beta1.1",
"beta2.3.0",
"beta2.33.1",
"a001",
"a2",
"z002",
"z1"]
def key(k):
nums=set(list("0123456789"))
chars=set(list(k))
chars=chars-nums
for i in range(len(k)):
for c in chars:
k=k.replace(c+"0",c)
l=list(k)
base=10
j=0
for i in range(len(l)-1,-1,-1):
try:
l[i]=int(l[i])*base**j
j+=1
except:
j=0
l=tuple(l)
print l
return l
print sorted(alist,key=key)
输出:
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 1)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 7)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 3)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 4)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 9)
('b', 'e', 't', 'a', 1, '.', 1)
('b', 'e', 't', 'a', 2, '.', 3, '.')
('b', 'e', 't', 'a', 2, '.', 30, 3, '.', 1)
('a', 1)
('a', 2)
('z', 2)
('z', 1)
['a001', 'a2', 'beta1.1', 'beta2.3.0', 'beta2.33.1', 'something1', 'something2', 'something12', 'something17', 'something25and_then_33', 'something25and_then_34', 'something29', 'z1', 'z002']
在@Mark Byers的回答之后,这里有一个接受关键参数的适应,并且更符合pep8。
def natsorted(seq, key=None):
def convert(text):
return int(text) if text.isdigit() else text
def alphanum(obj):
if key is not None:
return [convert(c) for c in re.split(r'([0-9]+)', key(obj))]
return [convert(c) for c in re.split(r'([0-9]+)', obj)]
return sorted(seq, key=alphanum)
我还做了一个Gist