将字符串转换为有效的文件名?

我有一个字符串，我想用它作为文件名，所以我想用Python删除文件名中不允许的所有字符。

我宁愿严格一点，所以假设我想只保留字母、数字和一小组其他字符，如“_-.()”。”。最优雅的解决方案是什么?

文件名需要在多个操作系统(Windows, Linux和Mac OS)上有效——它是我库中的一个MP3文件，以歌曲标题为文件名，并在3台机器之间共享和备份。

当前回答

您可以将列表推导式与字符串方法一起使用。

>>> s
'foo-bar#baz?qux@127/\\9]'
>>> "".join(x for x in s if x.isalnum())
'foobarbazqux1279'

2008-11-17 09:12:49

其他回答

不过你得小心点。如果你只看拉丁语言，在你的介绍中没有清楚地说出来。如果您仅使用ascii字符对某些单词进行消毒，它们可能会变得毫无意义或具有其他含义。

假设你有“forêt poésie”(森林诗歌)，你的消毒可能会给“堡垒-posie”(强大+无意义的东西)

如果你必须处理汉字，那就更糟了。

“下北沢”您的系统可能最终会执行“——”，这注定会在一段时间后失败，而且没有多大帮助。因此，如果您只处理文件，我建议您将它们称为您控制的通用链，或者保持字符原样。对于uri，大致相同。

2009-03-11 10:44:46

你可以看看Django框架(但要考虑到许可!)，看看他们如何从任意文本中创建一个“slug”。段代码是URL和文件名友好的。

Django文本utils定义了一个函数，slugify()，这可能是这种事情的黄金标准。本质上，他们的代码如下。

import unicodedata
import re

def slugify(value, allow_unicode=False):
    """
    Taken from https://github.com/django/django/blob/master/django/utils/text.py
    Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
    dashes to single dashes. Remove characters that aren't alphanumerics,
    underscores, or hyphens. Convert to lowercase. Also strip leading and
    trailing whitespace, dashes, and underscores.
    """
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub(r'[^\w\s-]', '', value.lower())
    return re.sub(r'[-\s]+', '-', value).strip('-_')

旧版本是:

def slugify(value):
    """
    Normalizes string, converts to lowercase, removes non-alpha characters,
    and converts spaces to hyphens.
    """
    import unicodedata
    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
    value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
    value = unicode(re.sub('[-\s]+', '-', value))
    # ...
    return value

还有更多，但我把它省略了，因为它没有解决怠惰，而是逃避。

2008-11-17 12:23:52

就像S.Lott回答的那样，你可以看看Django框架如何将字符串转换为有效的文件名。

最新和更新的版本在utils/text.py中，并定义了"get_valid_filename"，如下所示:

def get_valid_filename(s):
    s = str(s).strip().replace(' ', '_')
    return re.sub(r'(?u)[^-\w.]', '', s)

(见https://github.com/django/django/blob/master/django/utils/text.py)

2017-10-18 00:24:44

这是Windows特定路径的另一个答案，使用简单的替换，没有时髦的模块:

import re

def check_for_illegal_char(input_str):
    # remove illegal characters for Windows file names/paths 
    # (illegal filenames are a superset (41) of the illegal path names (36))
    # this is according to windows blacklist obtained with Powershell
    # from: https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names/44750843#44750843
    #
    # PS> $enc = [system.Text.Encoding]::UTF8
    # PS> $FileNameInvalidChars = [System.IO.Path]::GetInvalidFileNameChars()
    # PS> $FileNameInvalidChars | foreach { $enc.GetBytes($_) } | Out-File -FilePath InvalidFileCharCodes.txt

    illegal = '\u0022\u003c\u003e\u007c\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008' + \
              '\u0009\u000a\u000b\u000c\u000d\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015' + \
              '\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f\u003a\u002a\u003f\u005c\u002f' 

    output_str, _ = re.subn('['+illegal+']','_', input_str)
    output_str = output_str.replace('\\','_')   # backslash cannot be handled by regex
    output_str = output_str.replace('..','_')   # double dots are illegal too, or at least a bad idea 
    output_str = output_str[:-1] if output_str[-1] == '.' else output_str # can't have end of line '.'

    if output_str != input_str:
        print(f"The name '{input_str}' had invalid characters, "
              f"name was modified to '{output_str}'")

    return output_str

当测试check_for_illegal_char('fas\u0003\u0004good\\..asd.')，我得到:

The name 'fas♥♦good\..asd.' had invalid characters, name was modified to 'fas__good__asd'

2021-06-22 12:30:24

您可以将列表推导式与字符串方法一起使用。

>>> s
'foo-bar#baz?qux@127/\\9]'
>>> "".join(x for x in s if x.isalnum())
'foobarbazqux1279'

2008-11-17 09:12:49

将字符串转换为有效的文件名?

推荐文章

最新文章

标签