我有一个字符串,我想用它作为文件名,所以我想用Python删除文件名中不允许的所有字符。

我宁愿严格一点,所以假设我想只保留字母、数字和一小组其他字符,如“_-.()”。”。最优雅的解决方案是什么?

文件名需要在多个操作系统(Windows, Linux和Mac OS)上有效——它是我库中的一个MP3文件,以歌曲标题为文件名,并在3台机器之间共享和备份。


当前回答

Github上有个不错的项目叫python-slugify:

安装:

pip install python-slugify

然后使用:

>>> from slugify import slugify
>>> txt = "This\ is/ a%#$ test ---"
>>> slugify(txt)
'this-is-a-test'

其他回答

This whitelist approach (ie, allowing only the chars present in valid_chars) will work if there aren't limits on the formatting of the files or combination of valid chars that are illegal (like ".."), for example, what you say would allow a filename named " . txt" which I think is not valid on Windows. As this is the most simple approach I'd try to remove whitespace from the valid_chars and prepend a known valid string in case of error, any other approach will have to know about what is allowed where to cope with Windows file naming limitations and thus be a lot more complex.

>>> import string
>>> valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
>>> valid_chars
'-_.() abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
>>> filename = "This Is a (valid) - filename%$&$ .txt"
>>> ''.join(c for c in filename if c in valid_chars)
'This Is a (valid) - filename .txt'

这是Windows特定路径的另一个答案,使用简单的替换,没有时髦的模块:

import re

def check_for_illegal_char(input_str):
    # remove illegal characters for Windows file names/paths 
    # (illegal filenames are a superset (41) of the illegal path names (36))
    # this is according to windows blacklist obtained with Powershell
    # from: https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names/44750843#44750843
    #
    # PS> $enc = [system.Text.Encoding]::UTF8
    # PS> $FileNameInvalidChars = [System.IO.Path]::GetInvalidFileNameChars()
    # PS> $FileNameInvalidChars | foreach { $enc.GetBytes($_) } | Out-File -FilePath InvalidFileCharCodes.txt

    illegal = '\u0022\u003c\u003e\u007c\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008' + \
              '\u0009\u000a\u000b\u000c\u000d\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015' + \
              '\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f\u003a\u002a\u003f\u005c\u002f' 

    output_str, _ = re.subn('['+illegal+']','_', input_str)
    output_str = output_str.replace('\\','_')   # backslash cannot be handled by regex
    output_str = output_str.replace('..','_')   # double dots are illegal too, or at least a bad idea 
    output_str = output_str[:-1] if output_str[-1] == '.' else output_str # can't have end of line '.'

    if output_str != input_str:
        print(f"The name '{input_str}' had invalid characters, "
              f"name was modified to '{output_str}'")

    return output_str

当测试check_for_illegal_char('fas\u0003\u0004good\\..asd.'),我得到:

The name 'fas♥♦good\..asd.' had invalid characters, name was modified to 'fas__good__asd'

请记住,在Unix系统上实际上没有文件名限制

它可能不包含\0 它可能不包含/

其他一切都是公平的。

$ touch "
> even multiline
> haha
> ^[[31m red ^[[0m
> evil"
$ ls -la 
-rw-r--r--       0 Nov 17 23:39 ?even multiline?haha??[31m red ?[0m?evil
$ ls -lab
-rw-r--r--       0 Nov 17 23:39 \neven\ multiline\nhaha\n\033[31m\ red\ \033[0m\nevil
$ perl -e 'for my $i ( glob(q{./*even*}) ){ print $i; } '
./
even multiline
haha
 red 
evil

是的,我只是将ANSI颜色代码存储在一个文件名中,并使它们生效。

为了娱乐,在目录名中放入一个BEL字符,并观看当您CD到其中时所产生的乐趣;)

不过你得小心点。如果你只看拉丁语言,在你的介绍中没有清楚地说出来。如果您仅使用ascii字符对某些单词进行消毒,它们可能会变得毫无意义或具有其他含义。

假设你有“forêt poésie”(森林诗歌),你的消毒可能会给“堡垒-posie”(强大+无意义的东西)

如果你必须处理汉字,那就更糟了。

“下北沢”您的系统可能最终会执行“——”,这注定会在一段时间后失败,而且没有多大帮助。因此,如果您只处理文件,我建议您将它们称为您控制的通用链,或者保持字符原样。对于uri,大致相同。

>>> import string
>>> safechars = bytearray(('_-.()' + string.digits + string.ascii_letters).encode())
>>> allchars = bytearray(range(0x100))
>>> deletechars = bytearray(set(allchars) - set(safechars))
>>> filename = u'#ab\xa0c.$%.txt'
>>> safe_filename = filename.encode('ascii', 'ignore').translate(None, deletechars).decode()
>>> safe_filename
'abc..txt'

它不处理空字符串,特殊文件名('nul', 'con'等)。