我有一个字符串,我想用它作为文件名,所以我想用Python删除文件名中不允许的所有字符。
我宁愿严格一点,所以假设我想只保留字母、数字和一小组其他字符,如“_-.()”。”。最优雅的解决方案是什么?
文件名需要在多个操作系统(Windows, Linux和Mac OS)上有效——它是我库中的一个MP3文件,以歌曲标题为文件名,并在3台机器之间共享和备份。
我有一个字符串,我想用它作为文件名,所以我想用Python删除文件名中不允许的所有字符。
我宁愿严格一点,所以假设我想只保留字母、数字和一小组其他字符,如“_-.()”。”。最优雅的解决方案是什么?
文件名需要在多个操作系统(Windows, Linux和Mac OS)上有效——它是我库中的一个MP3文件,以歌曲标题为文件名,并在3台机器之间共享和备份。
当前回答
更复杂的是,仅仅删除无效字符并不能保证获得有效的文件名。由于允许的字符在不同的文件名上不同,保守的方法可能最终会将一个有效的名称变成一个无效的名称。你可能想在以下情况下添加特殊处理:
字符串都是无效字符(留下一个空字符串) 你最终会得到一个具有特殊含义的字符串,例如"."或".." 在windows中,某些设备名称是保留的。例如,你不能创建一个名为“nul”,“null .txt”(或nul.txt)的文件。)保留的名字是: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9
您可能可以通过在文件名前添加一些字符串来解决这些问题,这些字符串永远不会导致这些情况之一,并剥离无效字符。
其他回答
这是Windows特定路径的另一个答案,使用简单的替换,没有时髦的模块:
import re
def check_for_illegal_char(input_str):
# remove illegal characters for Windows file names/paths
# (illegal filenames are a superset (41) of the illegal path names (36))
# this is according to windows blacklist obtained with Powershell
# from: https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names/44750843#44750843
#
# PS> $enc = [system.Text.Encoding]::UTF8
# PS> $FileNameInvalidChars = [System.IO.Path]::GetInvalidFileNameChars()
# PS> $FileNameInvalidChars | foreach { $enc.GetBytes($_) } | Out-File -FilePath InvalidFileCharCodes.txt
illegal = '\u0022\u003c\u003e\u007c\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008' + \
'\u0009\u000a\u000b\u000c\u000d\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015' + \
'\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f\u003a\u002a\u003f\u005c\u002f'
output_str, _ = re.subn('['+illegal+']','_', input_str)
output_str = output_str.replace('\\','_') # backslash cannot be handled by regex
output_str = output_str.replace('..','_') # double dots are illegal too, or at least a bad idea
output_str = output_str[:-1] if output_str[-1] == '.' else output_str # can't have end of line '.'
if output_str != input_str:
print(f"The name '{input_str}' had invalid characters, "
f"name was modified to '{output_str}'")
return output_str
当测试check_for_illegal_char('fas\u0003\u0004good\\..asd.'),我得到:
The name 'fas♥♦good\..asd.' had invalid characters, name was modified to 'fas__good__asd'
不完全是OP要求的,但这是我使用的,因为我需要唯一的和可逆的转换:
# p3 code
def safePath (url):
return ''.join(map(lambda ch: chr(ch) if ch in safePath.chars else '%%%02x' % ch, url.encode('utf-8')))
safePath.chars = set(map(lambda x: ord(x), '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz+-_ .'))
结果“有些”可读,至少从系统管理员的角度来看是这样。
请记住,在Unix系统上实际上没有文件名限制
它可能不包含\0 它可能不包含/
其他一切都是公平的。
$ touch " > even multiline > haha > ^[[31m red ^[[0m > evil" $ ls -la -rw-r--r-- 0 Nov 17 23:39 ?even multiline?haha??[31m red ?[0m?evil $ ls -lab -rw-r--r-- 0 Nov 17 23:39 \neven\ multiline\nhaha\n\033[31m\ red\ \033[0m\nevil $ perl -e 'for my $i ( glob(q{./*even*}) ){ print $i; } ' ./ even multiline haha red evil
是的,我只是将ANSI颜色代码存储在一个文件名中,并使它们生效。
为了娱乐,在目录名中放入一个BEL字符,并观看当您CD到其中时所产生的乐趣;)
给,这应该涵盖了所有的基础。它为您处理所有类型的问题,包括(但不限于)字符替换。
适用于Windows、*nix和几乎所有其他文件系统。只允许打印字符。
def txt2filename(txt, chr_set='normal'):
"""Converts txt to a valid Windows/*nix filename with printable characters only.
args:
txt: The str to convert.
chr_set: 'normal', 'universal', or 'inclusive'.
'universal': ' -.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
'normal': Every printable character exept those disallowed on Windows/*nix.
'extended': All 'normal' characters plus the extended character ASCII codes 128-255
"""
FILLER = '-'
# Step 1: Remove excluded characters.
if chr_set == 'universal':
# Lookups in a set are O(n) vs O(n * x) for a str.
printables = set(' -.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz')
else:
if chr_set == 'normal':
max_chr = 127
elif chr_set == 'extended':
max_chr = 256
else:
raise ValueError(f'The chr_set argument may be normal, extended or universal; not {chr_set=}')
EXCLUDED_CHRS = set(r'<>:"/\|?*') # Illegal characters in Windows filenames.
EXCLUDED_CHRS.update(chr(127)) # DEL (non-printable).
printables = set(chr(x)
for x in range(32, max_chr)
if chr(x) not in EXCLUDED_CHRS)
result = ''.join(x if x in printables else FILLER # Allow printable characters only.
for x in txt)
# Step 2: Device names, '.', and '..' are invalid filenames in Windows.
DEVICE_NAMES = 'CON,PRN,AUX,NUL,COM1,COM2,COM3,COM4,' \
'COM5,COM6,COM7,COM8,COM9,LPT1,LPT2,' \
'LPT3,LPT4,LPT5,LPT6,LPT7,LPT8,LPT9,' \
'CONIN$,CONOUT$,..,.'.split() # This list is an O(n) operation.
if result in DEVICE_NAMES:
result = f'-{result}-'
# Step 3: Maximum length of filename is 255 bytes in Windows and Linux (other *nix flavors may allow longer names).
result = result[:255]
# Step 4: Windows does not allow filenames to end with '.' or ' ' or begin with ' '.
result = re.sub(r'^[. ]', FILLER, result)
result = re.sub(r' $', FILLER, result)
return result
这个解决方案不需要外部库。它也替代了不可打印的文件名,因为它们并不总是容易处理。
This whitelist approach (ie, allowing only the chars present in valid_chars) will work if there aren't limits on the formatting of the files or combination of valid chars that are illegal (like ".."), for example, what you say would allow a filename named " . txt" which I think is not valid on Windows. As this is the most simple approach I'd try to remove whitespace from the valid_chars and prepend a known valid string in case of error, any other approach will have to know about what is allowed where to cope with Windows file naming limitations and thus be a lot more complex.
>>> import string
>>> valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
>>> valid_chars
'-_.() abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
>>> filename = "This Is a (valid) - filename%$&$ .txt"
>>> ''.join(c for c in filename if c in valid_chars)
'This Is a (valid) - filename .txt'