我知道/在Linux中是非法的,下面这些在Windows中是非法的 (我认为)*。" / \ []:;|,
我还遗漏了什么?
然而,我需要一份全面的指南,一份考虑到各种因素的指南 双字节字符。链接到外部资源对我来说很好。
我需要首先在文件系统上创建一个目录,其名称可能是 包含禁用字符,所以我计划将这些字符替换为 下划线。然后,我需要将这个目录及其内容写入一个zip文件 (使用Java),因此关于zip目录名称的任何其他建议 不胜感激。
我知道/在Linux中是非法的,下面这些在Windows中是非法的 (我认为)*。" / \ []:;|,
我还遗漏了什么?
然而,我需要一份全面的指南,一份考虑到各种因素的指南 双字节字符。链接到外部资源对我来说很好。
我需要首先在文件系统上创建一个目录,其名称可能是 包含禁用字符,所以我计划将这些字符替换为 下划线。然后,我需要将这个目录及其内容写入一个zip文件 (使用Java),因此关于zip目录名称的任何其他建议 不胜感激。
禁止文件名字符的“全面指南”在Windows上不起作用,因为它保留了文件名和字符。是的,像这样的角色 *”?还有一些名字是禁止使用的,但是有无数个名字是只由有效字符组成的,是禁止使用的。例如,空格和点是有效的文件名字符,但仅由这些字符组成的名称是禁止的。
Windows不区分大写字母和小写字母,因此如果已经存在名为a的文件夹,则不能创建名为a的文件夹。更糟糕的是,像PRN和CON这样看似允许的名字,以及许多其他的名字,是被保留和不允许的。Windows也有一些长度限制;在一个文件夹中有效的文件名如果移到另一个文件夹中可能会失效。的规则 命名文件和文件夹 都在微软文档里。
一般来说,不能使用用户生成的文本创建Windows目录名。如果您希望允许用户任意命名,则必须创建安全的名称,如A、AB、A2等,将用户生成的名称及其等效路径存储在应用程序数据文件中,并在应用程序中执行路径映射。
如果您绝对必须允许用户生成文件夹名,那么判断它们是否无效的唯一方法是捕获异常并假定名称无效。即使这样也充满了危险,因为为拒绝访问、脱机驱动器和驱动器空间外抛出的异常与为无效名称抛出的异常重叠。你正在打开一个巨大的伤害罐。
好吧,如果只是为了研究目的,那么你最好的选择是看看维基百科上关于文件名的条目。
如果您想编写一个可移植的函数来验证用户输入并基于此创建文件名,简单的回答是不要。看一看像Perl的File::Spec这样的可移植模块,了解一下完成这样一个“简单”任务所需的所有跳转。
在Linux和其他与unix相关的系统下,传统上只有两个字符不能出现在文件或目录的名称中,那就是NUL '\0'和斜杠'/'。当然,斜杠可以出现在路径名中,分隔目录组件。
有传言说,史蒂文·伯恩(因“贝壳”而出名)有一个包含254个文件的目录,一个文件对应一个可以出现在文件名中的每个字母(字符代码)(不包括/,'\0';名字。当然是当前的目录)。它被用来测试伯恩外壳,并经常对备份程序等不小心的程序造成严重破坏。
其他人已经介绍了Windows文件名的规则,并提供了微软和维基百科的相关链接。
注意MacOS X有一个不区分大小写的文件系统。目前的版本似乎允许在文件名中使用冒号,尽管在历史上并不一定总是这样:
$ echo a:b > a:b
$ ls -l a:b
-rw-r--r-- 1 jonathanleffler staff 4 Nov 12 07:38 a:b
$
但是,至少在macOS Big Sur 11.7中,文件系统不允许文件名不是有效的UTF-8字符串。这意味着文件名不能由UTF-8中始终无效的字节组成(0xC0, 0xC1, 0xF5-0xFF),并且不能使用延续字节0x80..0xBF作为文件名中的唯一字节。给出的错误是92非法字节序列。
POSIX定义了一个可移植文件名字符集,包括:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . _ -
坚持使用完全由这些字符组成的名称可以避免大部分问题,不过Windows仍然增加了一些复杂性。
1
It was Kernighan & Pike in ['The Practice of Programming'](http://www.cs.princeton.edu/~bwk/tpop.webpage/) who said as much in Chapter 6, Testing, §6.5 Stress Tests:When Steve Bourne was writing his Unix shell (which came to be known as the Bourne shell), he made a directory of 254 files with one-character names, one for each byte value except '\0' and slash, the two characters that cannot appear in Unix file names. He used that directory for all manner of tests of pattern-matching and tokenization. (The test directory was of course created by a program.) For years afterwards, that directory was the bane of file-tree-walking programs; it tested them to destruction.
注意,该目录必须包含条目。和. .,所以它可以说是253个文件(和2个目录),或者255个名称条目,而不是254个文件。这并不影响轶事的有效性,也不影响它所描述的仔细测试。
TPOP之前在 http://plan9.bell-labs.com/cm/cs/tpop和 http://cm.bell-labs.com/cm/cs/tpop但现在(2021-11-12)都坏了。 参见维基百科上的TPOP。
您可以使用白名单,而不是创建字符黑名单。考虑到所有因素,在文件或目录名称上下文中有意义的字符范围非常短,除非您有一些非常特定的命名要求,否则如果用户不能使用整个ASCII表,他们不会反对您的应用程序。
它不能解决目标文件系统中保留名称的问题,但是使用白名单可以更容易地降低源上的风险。
本着这种精神,这是一个可以被认为是安全的字符范围:
字母(a-z a-z) -如果需要,也可以使用Unicode字符 数字(0 - 9) 下划线(_) 连字符(-) 空间 点号(.)
以及您希望允许的任何其他安全字符。除此之外,您还必须执行一些关于空格和点的附加规则。这通常就足够了:
名称必须包含至少一个字母或数字(以避免只有点/空格) 名称必须以字母或数字开头(以避免前导点/空格) 名称不能以点或空格结尾(如果存在,只需修剪它们,就像资源管理器一样)
这已经允许相当复杂和无意义的名称。例如,在这些规则下,这些名称是可能的,并且在Windows/Linux中是有效的文件名:
一个 ........... ext B -。- - - - - - ext
从本质上讲,即使白名单上的角色很少,你仍然应该决定什么是真正有意义的,并相应地验证/调整名称。在我的一个应用程序中,我使用了与上面相同的规则,但去掉了任何重复的点和空格。
The forbidden printable ASCII characters are: Linux/Unix: / (forward slash) Windows: < (less than) > (greater than) : (colon - sometimes works, but is actually NTFS Alternate Data Streams) " (double quote) / (forward slash) \ (backslash) | (vertical bar or pipe) ? (question mark) * (asterisk) Non-printable characters If your data comes from a source that would permit non-printable characters then there is more to check for. Linux/Unix: 0 (NULL byte) Windows: 0-31 (ASCII control characters) Note: While it is legal under Linux/Unix file systems to create files with control characters in the filename, it might be a nightmare for the users to deal with such files. Reserved file names The following filenames are reserved: Windows: CON, PRN, AUX, NUL COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9 LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9 (both on their own and with arbitrary file extensions, e.g. LPT1.txt). Other rules Windows: Filenames cannot end in a space or dot. macOS: You didn't ask for it, but just in case: Colon : and forward slash / depending on context are not permitted (e.g. Finder supports slashes, terminal supports colons). (More details)
要让Windows告诉你答案,最简单的方法是尝试通过资源管理器重命名文件,并输入反斜杠/作为新名称。Windows会弹出一个消息框,告诉你非法字符的列表。
A filename cannot contain any of the following characters:
\ / : * ? " < > |
微软文档-命名文件,路径和命名空间-命名约定
我也有同样的需求,正在寻找推荐信或标准推荐信,偶然发现了这条线索。我目前在文件和目录名中应该避免的字符黑名单是:
$CharactersInvalidForFileName = {
"pound" -> "#",
"left angle bracket" -> "<",
"dollar sign" -> "$",
"plus sign" -> "+",
"percent" -> "%",
"right angle bracket" -> ">",
"exclamation point" -> "!",
"backtick" -> "`",
"ampersand" -> "&",
"asterisk" -> "*",
"single quotes" -> "“",
"pipe" -> "|",
"left bracket" -> "{",
"question mark" -> "?",
"double quotes" -> "”",
"equal sign" -> "=",
"right bracket" -> "}",
"forward slash" -> "/",
"colon" -> ":",
"back slash" -> "\\",
"lank spaces" -> "b",
"at sign" -> "@"
};
虽然唯一非法的Unix字符可能是/和NULL,但应该考虑到命令行解释。
例如,虽然在Unix中将文件命名为1>&2或2>&1是合法的,但在命令行中使用这样的文件名可能会被误解。
类似地,也可以将文件命名为$PATH,但当试图从命令行访问它时,shell将把$PATH转换为其变量值。
在Unix shell中,您几乎可以用单引号引用每个字符。除了单引号本身之外,您不能表示控制字符,因为\没有展开。从带引号的字符串中访问单引号本身是可能的,因为您可以用单引号和双引号连接字符串,如'I' ' ' ' 'm',它可以用于访问名为"I'm"的文件(这里也可以使用双引号)。
因此应该避免所有控制字符,因为它们很难在shell中输入。其余部分仍然很有趣,特别是以破折号开头的文件,因为大多数命令将这些文件视为选项,除非您在前面有两个破折号,或者您使用./指定它们,这也隐藏了开头的-。
如果你想要更好,不要使用shell和典型命令使用的任何字符作为语法元素,有时依赖于位置,例如,你仍然可以使用-,但不能作为第一个字符;与.相同,只有当你想要使用它(“隐藏文件”)时,你才能将它作为第一个字符。如果您是恶意的,您的文件名是VT100转义序列;-),因此ls会使输出乱码。
截至2017年4月18日,这个话题的答案中没有简单的字符和文件名的黑白列表,而且有很多回复。
我能想到的最好的建议是让用户随意命名文件。当应用程序试图保存文件时,使用错误处理程序,捕捉任何异常,假定是文件名造成的错误(显然在确保保存路径也正确之后),并提示用户输入新的文件名。为了获得最好的结果,将这个检查过程放在一个循环中,直到用户正确或放弃为止。对我来说是最好的工作(至少在VBA)。
对于Windows,您可以使用PowerShell检查它
$PathInvalidChars = [System.IO.Path]::GetInvalidPathChars() #36 chars
要显示您可以转换的UTF-8代码
$enc = [system.Text.Encoding]::UTF8
$PathInvalidChars | foreach { $enc.GetBytes($_) }
$FileNameInvalidChars = [System.IO.Path]::GetInvalidFileNameChars() #41 chars
$FileOnlyInvalidChars = @(':', '*', '?', '\', '/') #5 chars - as a difference
下面是一个基于Christopher Oezbek的答案的windows c#实现
containsFolder布尔值使它更加复杂,但希望涵盖所有内容
/// <summary>
/// This will replace invalid chars with underscores, there are also some reserved words that it adds underscore to
/// </summary>
/// <remarks>
/// https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names
/// </remarks>
/// <param name="containsFolder">Pass in true if filename represents a folder\file (passing true will allow slash)</param>
public static string EscapeFilename_Windows(string filename, bool containsFolder = false)
{
StringBuilder builder = new StringBuilder(filename.Length + 12);
int index = 0;
// Allow colon if it's part of the drive letter
if (containsFolder)
{
Match match = Regex.Match(filename, @"^\s*[A-Z]:\\", RegexOptions.IgnoreCase);
if (match.Success)
{
builder.Append(match.Value);
index = match.Length;
}
}
// Character substitutions
for (int cntr = index; cntr < filename.Length; cntr++)
{
char c = filename[cntr];
switch (c)
{
case '\u0000':
case '\u0001':
case '\u0002':
case '\u0003':
case '\u0004':
case '\u0005':
case '\u0006':
case '\u0007':
case '\u0008':
case '\u0009':
case '\u000A':
case '\u000B':
case '\u000C':
case '\u000D':
case '\u000E':
case '\u000F':
case '\u0010':
case '\u0011':
case '\u0012':
case '\u0013':
case '\u0014':
case '\u0015':
case '\u0016':
case '\u0017':
case '\u0018':
case '\u0019':
case '\u001A':
case '\u001B':
case '\u001C':
case '\u001D':
case '\u001E':
case '\u001F':
case '<':
case '>':
case ':':
case '"':
case '/':
case '|':
case '?':
case '*':
builder.Append('_');
break;
case '\\':
builder.Append(containsFolder ? c : '_');
break;
default:
builder.Append(c);
break;
}
}
string built = builder.ToString();
if (built == "")
{
return "_";
}
if (built.EndsWith(" ") || built.EndsWith("."))
{
built = built.Substring(0, built.Length - 1) + "_";
}
// These are reserved names, in either the folder or file name, but they are fine if following a dot
// CON, PRN, AUX, NUL, COM0 .. COM9, LPT0 .. LPT9
builder = new StringBuilder(built.Length + 12);
index = 0;
foreach (Match match in Regex.Matches(built, @"(^|\\)\s*(?<bad>CON|PRN|AUX|NUL|COM\d|LPT\d)\s*(\.|\\|$)", RegexOptions.IgnoreCase))
{
Group group = match.Groups["bad"];
if (group.Index > index)
{
builder.Append(built.Substring(index, match.Index - index + 1));
}
builder.Append(group.Value);
builder.Append("_"); // putting an underscore after this keyword is enough to make it acceptable
index = group.Index + group.Length;
}
if (index == 0)
{
return built;
}
if (index < built.Length - 1)
{
builder.Append(built.Substring(index));
}
return builder.ToString();
}
讨论不同的可能方法
在定义,什么是合法的,什么是不合法的方面的困难已经得到解决,并提出了白名单。但不仅是Windows,很多unix操作系统也支持超过8位的字符,比如Unicode。您还可以在这里讨论编码,例如UTF-8。你可以考虑Jonathan Leffler的评论,他给出了现代Linux的信息,并描述了MacOS的细节。维基百科指出,(例如)
修饰语字母冒号[(见7。有时在Windows文件名中使用,因为它与用于文件名的Segoe UI字体中的冒号相同。[继承的ASCII]冒号本身是不允许的。
因此,我想提出一种更自由的方法,使用Unicode Homoglyph字符替换“非法”字符。我发现在我可比的用例中,结果可读性要强得多,而且它只受所使用字体的限制,它非常广泛,Windows默认为3903个字符。此外,您甚至可以从替换恢复原始内容。
可能的选择和研究笔记
为了保持内容的组织性,我将始终给出字符,它的名称和十六进制数表示。后者不区分大小写,前导零可以自由添加或省略,因此,例如U+002A和U+ 2a是等效的。如果可用,我会尽量指出更多的信息或替代品-请随时向我展示更多或更好的。
Instead of * (U+2A * ASTERISK), you can use one of the many listed, for example U+2217 ∗ (ASTERISK OPERATOR) or the Full Width Asterisk U+FF0A *. u+20f0 ⃰ combining asterisk above from combining diacritical marks for symbols might also be a valid choice. You can read 4. for more info about the combining characters. Instead of . (U+2E . full stop), one of these could be a good option, for example ⋅ U+22C5 dot operator. Instead of " (U+22 " quotation mark), you can use “ U+201C english leftdoublequotemark, more alternatives see here. I also included some of the good suggestions of Wally Brockway's answer, in this case u+2036 ‶ reversed double prime and u+2033 ″ double prime - I will from now on denote ideas from that source by ¹³. Instead of / (U+2F / SOLIDUS), you can use ∕ DIVISION SLASH U+2215 (others here), ̸ U+0338 COMBINING LONG SOLIDUS OVERLAY, ̷ COMBINING SHORT SOLIDUS OVERLAY U+0337 or u+2044 ⁄ fraction slash¹³. Be aware about spacing for some characters, including the combining or overlay ones, as they have no width and can produce something like -> ̸th̷is which is ̸th̷is. With added spaces you get -> ̸ th ̷ is, which is ̸ th ̷ is. The second one (COMBINING SHORT SOLIDUS OVERLAY) looks bad in the stackoverflow-font. Instead of \ (U+5C Reverse solidus), you can use ⧵ U+29F5 Reverse solidus operator (more) or u+20E5 ⃥ combining reverse solidus overlay¹³. To replace [ (U+5B [ Left square bracket) and ] (U+005D ] Right square bracket), you can use for example U+FF3B[ FULLWIDTH LEFT SQUARE BRACKET and U+FF3D ]FULLWIDTH RIGHT SQUARE BRACKET (from here, more possibilities here). Instead of : (u+3a : colon), you can use U+2236 ∶ RATIO (for mathematical usage) or U+A789 ꞉ MODIFIER LETTER COLON, (see colon (letter), sometimes used in Windows filenames as it is identical to the colon in the Segoe UI font used for filenames. The colon itself is not permitted ... source and more replacements see here). Another alternative is this one: u+1361 ፡ ethiopic wordspace¹³. Instead of ; (u+3b ; semicolon), you can use U+037E ; GREEK QUESTION MARK (see here). For | (u+7c | vertical line), there are some good substitutes such as: U+2223 ∣ DIVIDES, U+0964 । DEVANAGARI DANDA, U+01C0 ǀ LATIN LETTER DENTAL CLICK (the last ones from Wikipedia) or U+2D4F ⵏ Tifinagh Letter Yan. Also the box drawing characters contain various other options. Instead of , (, U+002C COMMA), you can use for example ‚ U+201A SINGLE LOW-9 QUOTATION MARK (see here). For ? (U+003F ? QUESTION MARK), these are good candidates: U+FF1F ? FULLWIDTH QUESTION MARK or U+FE56 ﹖ SMALL QUESTION MARK (from here and here). There are also two more from the Dingbats Block (search for "question") and the u+203d ‽ interrobang¹³. While my machine seems to accept it unchanged, I still want to include > (u+3e greater-than sign) and < (u+3c less-than sign) for the sake of completeness. The best replacement here is probably also from the quotation block, such as u+203a › single right-pointing angle quotation mark and u+2039 ‹ single left-pointing angle quotation mark respectively. The tifinagh block only contains ⵦ (u+2D66)¹³ to replace <. The last notion is ⋖ less-than with dot u+22D6 and ⋗ greater-than with dot u+22D7.
对于更多的想法,你也可以在这个块中寻找例子。你还想要更多的想法吗?你可以试着画出你想要的角色,看看这里的建议。
你怎么打这些字符
Say you want to type ⵏ (Tifinagh Letter Yan). To get all of its information, you can always search for this character (ⵏ) on a suited platform such as this Unicode Lookup (please add 0x when you search for hex) or that Unicode Table (that only allows to search for the name, in this case "Tifinagh Letter Yan"). You should obtain its Unicode number U+2D4F and the HTML-code ⵏ (note that 2D4F is hexadecimal for 11599). With this knowledge, you have several options to produce these special characters including the use of
code points to unicode converter or again the Unicode Lookup to reversely convert the numerical representation into the unicode character (remember to set the code point base below to decimal or hexadecimal respectively) a one-liner makro in Autohotkey: :?*:altpipe::{U+2D4F} to type ⵏ instead of the string altpipe - this is the way I input those special characters, my Autohotkey script can be shared if there is common interest Alt Characters or alt-codes by pressing and holding alt, followed by the decimal number for the desired character (more info for example here, look at a table here or there). For the example, that would be Alt+11599. Be aware, that many programs do not fully support this windows feature for all of unicode (as of time writing). Microsoft Office is an exception where it usually works, some other OSes provide similar functionality. Typing these chars with Alt-combinations into MS Word is also the way Wally Brockway suggests in his answer¹³ that was already mentionted - if you don't want to transfer all the hexadecimal values to the decimal asc, you can find some of them there¹³. in MS Office, you can also use ALT + X as described in this MS article to produce the chars if you rarely need it, you can of course still just copy-paste the special character of your choice instead of typing it
. net框架系统。IO对于无效的文件系统字符提供如下功能:
路径。GetInvalidFileNameChars 路径。GetInvalidPathChars
这些函数应该根据. net运行时所在的平台返回适当的结果。也就是说,这些函数的文档页中的备注说:
方法返回的数组不保证包含 文件和目录中无效的完整字符集 的名字。完整的无效字符集可能因文件系统而异。
I always assumed that banned characters in Windows filenames meant that all exotic characters would also be outlawed. The inability to use ?, / and : in particular irked me. One day I discovered that it was virtually only those chars which were banned. Other Unicode characters may be used. So the nearest Unicode characters to the banned ones I could find were identified and MS Word macros were made for them as Alt+?, Alt+: etc. Now I form the filename in Word, using the substitute chars, and copy it to the Windows filename. So far I have had no problems.
下面是替换字符(Alt +十进制Unicode):
⃰ ⇔ Alt8432 ⁄ ⇔ Alt8260 ⃥ ⇔ Alt8421 ∣ ⇔ Alt8739 ⵦ ⇔ Alt11622 ⮚ ⇔ Alt11162 ‽ ⇔ Alt8253 ፡ ⇔ Alt4961 ‵‵ ⇔ Alt8246 “ ⇔ Alt8243
作为测试,我用所有这些字符组成了一个文件名,Windows接受了它。
这对我来说在Python中已经足够好了:
def fix_filename(name, max_length=255):
"""
Replace invalid characters on Linux/Windows/MacOS with underscores.
List from https://stackoverflow.com/a/31976060/819417
Trailing spaces & periods are ignored on Windows.
>>> fix_filename(" COM1 ")
'_ COM1 _'
>>> fix_filename("COM10")
'COM10'
>>> fix_filename("COM1,")
'COM1,'
>>> fix_filename("COM1.txt")
'_.txt'
>>> all('_' == fix_filename(chr(i)) for i in list(range(32)))
True
"""
return re.sub(r'[/\\:|<>"?*\0-\x1f]|^(AUX|COM[1-9]|CON|LPT[1-9]|NUL|PRN)(?![^.])|^\s|[\s.]$', "_", name[:max_length], flags=re.IGNORECASE)
还可以查看这个过时的列表,以获得FAT32中的=等其他遗留内容。