我有一个字符串,我想用它作为文件名,所以我想用Python删除文件名中不允许的所有字符。

我宁愿严格一点,所以假设我想只保留字母、数字和一小组其他字符,如“_-.()”。”。最优雅的解决方案是什么?

文件名需要在多个操作系统(Windows, Linux和Mac OS)上有效——它是我库中的一个MP3文件,以歌曲标题为文件名,并在3台机器之间共享和备份。


当前回答

This whitelist approach (ie, allowing only the chars present in valid_chars) will work if there aren't limits on the formatting of the files or combination of valid chars that are illegal (like ".."), for example, what you say would allow a filename named " . txt" which I think is not valid on Windows. As this is the most simple approach I'd try to remove whitespace from the valid_chars and prepend a known valid string in case of error, any other approach will have to know about what is allowed where to cope with Windows file naming limitations and thus be a lot more complex.

>>> import string
>>> valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
>>> valid_chars
'-_.() abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
>>> filename = "This Is a (valid) - filename%$&$ .txt"
>>> ''.join(c for c in filename if c in valid_chars)
'This Is a (valid) - filename .txt'

其他回答

如果你不介意安装一个包,这应该是有用的: https://pypi.org/project/pathvalidate/

从https://pypi.org/project/pathvalidate/ # sanitize-a-filename:

来自您的插件信息 fname =菲:l * e / p \ " a ? t < t > h |。xt” 打印(f“fname) -> (sanitize_filename, fname) fname =“\0_a*b:c<d>e%f/(g)h+i_0.txt” 打印(f“fname) -> (sanitize_filename, fname) 输出 菲:洛杉矶* e - p”? t > h |。<xt ->档案 _a*b:c<d>e%f/(g)h+i_0.txt -> _abcde%f(g)h+i_0.txt

给,这应该涵盖了所有的基础。它为您处理所有类型的问题,包括(但不限于)字符替换。

适用于Windows、*nix和几乎所有其他文件系统。只允许打印字符。

def txt2filename(txt, chr_set='normal'):
    """Converts txt to a valid Windows/*nix filename with printable characters only.

    args:
        txt: The str to convert.
        chr_set: 'normal', 'universal', or 'inclusive'.
            'universal':    ' -.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
            'normal':       Every printable character exept those disallowed on Windows/*nix.
            'extended':     All 'normal' characters plus the extended character ASCII codes 128-255
    """

    FILLER = '-'

    # Step 1: Remove excluded characters.
    if chr_set == 'universal':
        # Lookups in a set are O(n) vs O(n * x) for a str.
        printables = set(' -.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz')
    else:
        if chr_set == 'normal':
            max_chr = 127
        elif chr_set == 'extended':
            max_chr = 256
        else:
            raise ValueError(f'The chr_set argument may be normal, extended or universal; not {chr_set=}')
        EXCLUDED_CHRS = set(r'<>:"/\|?*')               # Illegal characters in Windows filenames.
        EXCLUDED_CHRS.update(chr(127))                  # DEL (non-printable).
        printables = set(chr(x)
                         for x in range(32, max_chr)
                         if chr(x) not in EXCLUDED_CHRS)
    result = ''.join(x if x in printables else FILLER   # Allow printable characters only.
                     for x in txt)

    # Step 2: Device names, '.', and '..' are invalid filenames in Windows.
    DEVICE_NAMES = 'CON,PRN,AUX,NUL,COM1,COM2,COM3,COM4,' \
                   'COM5,COM6,COM7,COM8,COM9,LPT1,LPT2,' \
                   'LPT3,LPT4,LPT5,LPT6,LPT7,LPT8,LPT9,' \
                   'CONIN$,CONOUT$,..,.'.split()        # This list is an O(n) operation.
    if result in DEVICE_NAMES:
        result = f'-{result}-'

    # Step 3: Maximum length of filename is 255 bytes in Windows and Linux (other *nix flavors may allow longer names).
    result = result[:255]

    # Step 4: Windows does not allow filenames to end with '.' or ' ' or begin with ' '.
    result = re.sub(r'^[. ]', FILLER, result)
    result = re.sub(r' $', FILLER, result)

    return result

这个解决方案不需要外部库。它也替代了不可打印的文件名,因为它们并不总是容易处理。

我知道有很多答案,但它们大多依赖于正则表达式或外部模块,所以我想抛出我自己的答案。一个纯python函数,不需要外部模块,不使用正则表达式。我的方法不是清除无效字符,而是只允许有效字符。

def normalizefilename(fn):
    validchars = "-_.() "
    out = ""
    for c in fn:
      if str.isalpha(c) or str.isdigit(c) or (c in validchars):
        out += c
      else:
        out += "_"
    return out    

如果您愿意,您可以在开头向validchars变量添加您自己的有效字符,例如您的国家字母在英语字母表中不存在。这是您可能想要也可能不想要的:一些不运行UTF-8的文件系统在使用非ascii字符时可能仍然存在问题。

此函数用于测试单个文件名的有效性,因此它将路径分隔符替换为_,认为它们是无效字符。如果你想添加它,修改If以包含os路径分隔符是很简单的。

一句话:

valid_file_name = re.sub('[^\w_.)( -]', '', any_string)

你也可以用“_”字符让它更具可读性(例如替换斜杠)

不过你得小心点。如果你只看拉丁语言,在你的介绍中没有清楚地说出来。如果您仅使用ascii字符对某些单词进行消毒,它们可能会变得毫无意义或具有其他含义。

假设你有“forêt poésie”(森林诗歌),你的消毒可能会给“堡垒-posie”(强大+无意义的东西)

如果你必须处理汉字,那就更糟了。

“下北沢”您的系统可能最终会执行“——”,这注定会在一段时间后失败,而且没有多大帮助。因此,如果您只处理文件,我建议您将它们称为您控制的通用链,或者保持字符原样。对于uri,大致相同。