\d效率低于[0-9]

我昨天对一个答案做了评论，有人在正则表达式中使用[0123456789]而不是[0-9]或\d。我说过，使用范围或数字说明符可能比使用字符集更有效。

我决定今天测试一下，并惊讶地发现(至少在c#正则表达式引擎中)\d似乎比其他两个似乎没有太大区别的效率要低。下面是我的测试输出超过10000个随机字符串的1000个随机字符，其中5077实际上包含一个数字:

Regex \d           took 00:00:00.2141226 result: 5077/10000
Regex [0-9]        took 00:00:00.1357972 result: 5077/10000  63.42 % of first
Regex [0123456789] took 00:00:00.1388997 result: 5077/10000  64.87 % of first

这让我很惊讶，有两个原因，如果有人能解释一下，我会很感兴趣:

我本以为范围会比集合更有效地实现。我不明白为什么\d比[0-9]差。\d不仅仅是[0-9]的简写吗?

下面是测试代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace SO_RegexPerformance
{
    class Program
    {
        static void Main(string[] args)
        {
            var rand = new Random(1234);
            var strings = new List<string>();
            //10K random strings
            for (var i = 0; i < 10000; i++)
            {
                //generate random string
                var sb = new StringBuilder();
                for (var c = 0; c < 1000; c++)
                {
                    //add a-z randomly
                    sb.Append((char)('a' + rand.Next(26)));
                }
                //in roughly 50% of them, put a digit
                if (rand.Next(2) == 0)
                {
                    //replace 1 char with a digit 0-9
                    sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));
                }
                strings.Add(sb.ToString());
            }

            var baseTime = testPerfomance(strings, @"\d");
            Console.WriteLine();
            var testTime = testPerfomance(strings, "[0-9]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
            testTime = testPerfomance(strings, "[0123456789]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
        }

        private static TimeSpan testPerfomance(List<string> strings, string regex)
        {
            var sw = new Stopwatch();

            int successes = 0;

            var rex = new Regex(regex);

            sw.Start();
            foreach (var str in strings)
            {
                if (rex.Match(str).Success)
                {
                    successes++;
                }
            }
            sw.Stop();

            Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);

            return sw.Elapsed;
        }
    }
}

当前回答

从“\d”在正则表达式中的意思是数字吗?：

[0-9]不等于\d。[0-9]只匹配0123456789字符，而\d匹配[0-9]和其他数字字符，例如东部阿拉伯数字٠١٢٣٤٥٦٧٨٩

2013-05-18 07:27:14

其他回答

感谢ByteBlast在文档中注意到这一点。只是改变了正则表达式的构造函数:

var rex = new Regex(regex, RegexOptions.ECMAScript);

给出新的时间:

Regex \d           took 00:00:00.1355787 result: 5077/10000
Regex [0-9]        took 00:00:00.1360403 result: 5077/10000  100.34 % of first
Regex [0123456789] took 00:00:00.1362112 result: 5077/10000  100.47 % of first

2013-05-18 09:37:17

\d检查所有Unicode数字，而[0-9]仅限于这10个字符。例如，波斯数字۱۲۳۴۵۶۷۸۹就是一个Unicode数字的例子，它与\d匹配，而不是[0-9]。

您可以使用以下代码生成所有此类字符的列表:

var sb = new StringBuilder();
for(UInt16 i = 0; i < UInt16.MaxValue; i++)
{
    string str = Convert.ToChar(i).ToString();
    if (Regex.IsMatch(str, @"\d"))
        sb.Append(str);
}
Console.WriteLine(sb.ToString());

生成:

0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙0123456789

2013-05-18 07:24:11

从“\d”在正则表达式中的意思是数字吗?：

[0-9]不等于\d。[0-9]只匹配0123456789字符，而\d匹配[0-9]和其他数字字符，例如东部阿拉伯数字٠١٢٣٤٥٦٧٨٩

2013-05-18 07:27:14

\d效率低于[0-9]

推荐文章

最新文章

标签