我想从字符串中删除所有特殊字符。允许输入A-Z(大写或小写)、数字(0-9)、下划线(_)或点符号(.)。
我有以下,它是有效的,但我怀疑(我知道!)它不是很有效:
public static string RemoveSpecialCharacters(string str)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.Length; i++)
{
if ((str[i] >= '0' && str[i] <= '9')
|| (str[i] >= 'A' && str[i] <= 'z'
|| (str[i] == '.' || str[i] == '_')))
{
sb.Append(str[i]);
}
}
return sb.ToString();
}
最有效的方法是什么?正则表达式是什么样子的,它与普通字符串操作相比如何?
要清洗的字符串相当短,长度通常在10到30个字符之间。
下面的代码有以下输出(结论是,我们也可以节省一些内存资源分配数组更小的大小):
lookup = new bool[123];
for (var c = '0'; c <= '9'; c++)
{
lookup[c] = true; System.Diagnostics.Debug.WriteLine((int)c + ": " + (char)c);
}
for (var c = 'A'; c <= 'Z'; c++)
{
lookup[c] = true; System.Diagnostics.Debug.WriteLine((int)c + ": " + (char)c);
}
for (var c = 'a'; c <= 'z'; c++)
{
lookup[c] = true; System.Diagnostics.Debug.WriteLine((int)c + ": " + (char)c);
}
48: 0
49: 1
50: 2
51: 3
52: 4
53: 5
54: 6
55: 7
56: 8
57: 9
65: A
66: B
67: C
68: D
69: E
70: F
71: G
72: H
73: I
74: J
75: K
76: L
77: M
78: N
79: O
80: P
81: Q
82: R
83: S
84: T
85: U
86: V
87: W
88: X
89: Y
90: Z
97: a
98: b
99: c
100: d
101: e
102: f
103: g
104: h
105: i
106: j
107: k
108: l
109: m
110: n
111: o
112: p
113: q
114: r
115: s
116: t
117: u
118: v
119: w
120: x
121: y
122: z
你也可以添加以下代码行来支持俄语区域设置(数组大小为1104):
for (var c = 'А'; c <= 'Я'; c++)
{
lookup[c] = true; System.Diagnostics.Debug.WriteLine((int)c + ": " + (char)c);
}
for (var c = 'а'; c <= 'я'; c++)
{
lookup[c] = true; System.Diagnostics.Debug.WriteLine((int)c + ": " + (char)c);
}
你为什么认为你的方法没有效率?这实际上是最有效的方法之一。
当然,你应该将字符读入局部变量或使用枚举器来减少数组访问的次数:
public static string RemoveSpecialCharacters(this string str) {
StringBuilder sb = new StringBuilder();
foreach (char c in str) {
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '.' || c == '_') {
sb.Append(c);
}
}
return sb.ToString();
}
使这种方法高效的一个原因是它具有良好的可伸缩性。执行时间将与字符串的长度相关。如果你把它用在一根大弦上,就不会有令人讨厌的意外。
编辑:
我做了一个快速的性能测试,用24个字符串运行每个函数一百万次。结果如下:
原功能:54.5毫秒。
我建议的修改:47.1毫秒。
采矿设置StringBuilder容量:43.3 ms。
正则表达式:294.4毫秒。
编辑2:
我在上面的代码中添加了A-Z和A-Z之间的区别。(我重新进行了性能测试,没有明显的差异。)
编辑3:
我测试了lookup+char[]解决方案,它运行大约13毫秒。
当然,这样做的代价是初始化庞大的查找表并将其保存在内存中。好吧,虽然没有那么多数据,但对于这样一个微不足道的函数来说,已经够多了……
private static bool[] _lookup;
static Program() {
_lookup = new bool[65536];
for (char c = '0'; c <= '9'; c++) _lookup[c] = true;
for (char c = 'A'; c <= 'Z'; c++) _lookup[c] = true;
for (char c = 'a'; c <= 'z'; c++) _lookup[c] = true;
_lookup['.'] = true;
_lookup['_'] = true;
}
public static string RemoveSpecialCharacters(string str) {
char[] buffer = new char[str.Length];
int index = 0;
foreach (char c in str) {
if (_lookup[c]) {
buffer[index] = c;
index++;
}
}
return new string(buffer, 0, index);
}
I had to do something similar for work, but in my case I had to filter all that is not a letter, number or whitespace (but you could easily modify it to your needs).
The filtering is done client-side in JavaScript, but for security reasons I am also doing the filtering server-side. Since I can expect most of the strings to be clean, I would like to avoid copying the string unless I really need to. This let my to the implementation below, which should perform better for both clean and dirty strings.
public static string EnsureOnlyLetterDigitOrWhiteSpace(string input)
{
StringBuilder cleanedInput = null;
for (var i = 0; i < input.Length; ++i)
{
var currentChar = input[i];
var charIsValid = char.IsLetterOrDigit(currentChar) || char.IsWhiteSpace(currentChar);
if (charIsValid)
{
if(cleanedInput != null)
cleanedInput.Append(currentChar);
}
else
{
if (cleanedInput != null) continue;
cleanedInput = new StringBuilder();
if (i > 0)
cleanedInput.Append(input.Substring(0, i));
}
}
return cleanedInput == null ? input : cleanedInput.ToString();
}