switch语句真的比if语句快吗?

我在Visual Studio 2010的x64 c++编译器上使用/Ox标志运行下面的代码:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define MAX_COUNT (1 << 29)
size_t counter = 0;

size_t testSwitch()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        switch (counter % 4 + 1)
        {
            case 1: counter += 4; break;
            case 2: counter += 3; break;
            case 3: counter += 2; break;
            case 4: counter += 1; break;
        }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

size_t testIf()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = counter % 4 + 1;
        if (c == 1) { counter += 4; }
        else if (c == 2) { counter += 3; }
        else if (c == 3) { counter += 2; }
        else if (c == 4) { counter += 1; }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

int main()
{
    printf("Starting...\n");
    printf("Switch statement: %u ms\n", testSwitch());
    printf("If     statement: %u ms\n", testIf());
}

并得到了这些结果:

Switch语句:5261 ms If语句:5196毫秒

据我所知,switch语句显然使用跳转表来优化分支。

问题:

在x86或x64中,基本的跳转表是什么样的? 这段代码是否使用了跳转表? 为什么在这个例子中没有性能差异?是否存在存在显著性能差异的情况?


代码的反汇编:

testIf:

13FE81B10 sub  rsp,48h 
13FE81B14 call qword ptr [__imp_clock (13FE81128h)] 
13FE81B1A mov  dword ptr [start],eax 
13FE81B1E mov  qword ptr [i],0 
13FE81B27 jmp  testIf+26h (13FE81B36h) 
13FE81B29 mov  rax,qword ptr [i] 
13FE81B2E inc  rax  
13FE81B31 mov  qword ptr [i],rax 
13FE81B36 cmp  qword ptr [i],20000000h 
13FE81B3F jae  testIf+0C3h (13FE81BD3h) 
13FE81B45 xor  edx,edx 
13FE81B47 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B4E mov  ecx,4 
13FE81B53 div  rax,rcx 
13FE81B56 mov  rax,rdx 
13FE81B59 inc  rax  
13FE81B5C mov  qword ptr [c],rax 
13FE81B61 cmp  qword ptr [c],1 
13FE81B67 jne  testIf+6Dh (13FE81B7Dh) 
13FE81B69 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B70 add  rax,4 
13FE81B74 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B7B jmp  testIf+0BEh (13FE81BCEh) 
13FE81B7D cmp  qword ptr [c],2 
13FE81B83 jne  testIf+89h (13FE81B99h) 
13FE81B85 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B8C add  rax,3 
13FE81B90 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B97 jmp  testIf+0BEh (13FE81BCEh) 
13FE81B99 cmp  qword ptr [c],3 
13FE81B9F jne  testIf+0A5h (13FE81BB5h) 
13FE81BA1 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BA8 add  rax,2 
13FE81BAC mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BB3 jmp  testIf+0BEh (13FE81BCEh) 
13FE81BB5 cmp  qword ptr [c],4 
13FE81BBB jne  testIf+0BEh (13FE81BCEh) 
13FE81BBD mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BC4 inc  rax  
13FE81BC7 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BCE jmp  testIf+19h (13FE81B29h) 
13FE81BD3 call qword ptr [__imp_clock (13FE81128h)] 
13FE81BD9 sub  eax,dword ptr [start] 
13FE81BDD imul eax,eax,3E8h 
13FE81BE3 cdq       
13FE81BE4 mov  ecx,3E8h 
13FE81BE9 idiv eax,ecx 
13FE81BEB cdqe      
13FE81BED add  rsp,48h 
13FE81BF1 ret       

testSwitch:

13FE81C00 sub  rsp,48h 
13FE81C04 call qword ptr [__imp_clock (13FE81128h)] 
13FE81C0A mov  dword ptr [start],eax 
13FE81C0E mov  qword ptr [i],0 
13FE81C17 jmp  testSwitch+26h (13FE81C26h) 
13FE81C19 mov  rax,qword ptr [i] 
13FE81C1E inc  rax  
13FE81C21 mov  qword ptr [i],rax 
13FE81C26 cmp  qword ptr [i],20000000h 
13FE81C2F jae  testSwitch+0C5h (13FE81CC5h) 
13FE81C35 xor  edx,edx 
13FE81C37 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C3E mov  ecx,4 
13FE81C43 div  rax,rcx 
13FE81C46 mov  rax,rdx 
13FE81C49 inc  rax  
13FE81C4C mov  qword ptr [rsp+30h],rax 
13FE81C51 cmp  qword ptr [rsp+30h],1 
13FE81C57 je   testSwitch+73h (13FE81C73h) 
13FE81C59 cmp  qword ptr [rsp+30h],2 
13FE81C5F je   testSwitch+87h (13FE81C87h) 
13FE81C61 cmp  qword ptr [rsp+30h],3 
13FE81C67 je   testSwitch+9Bh (13FE81C9Bh) 
13FE81C69 cmp  qword ptr [rsp+30h],4 
13FE81C6F je   testSwitch+0AFh (13FE81CAFh) 
13FE81C71 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C73 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C7A add  rax,4 
13FE81C7E mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C85 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C87 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C8E add  rax,3 
13FE81C92 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C99 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C9B mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CA2 add  rax,2 
13FE81CA6 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CAD jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81CAF mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CB6 inc  rax  
13FE81CB9 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CC0 jmp  testSwitch+19h (13FE81C19h) 
13FE81CC5 call qword ptr [__imp_clock (13FE81128h)] 
13FE81CCB sub  eax,dword ptr [start] 
13FE81CCF imul eax,eax,3E8h 
13FE81CD5 cdq       
13FE81CD6 mov  ecx,3E8h 
13FE81CDB idiv eax,ecx 
13FE81CDD cdqe      
13FE81CDF add  rsp,48h 
13FE81CE3 ret       

更新:

有趣的结果。但不知道为什么一个更快,一个更慢。


当前回答

以下是旧的(现在很难找到)bench++基准测试的一些结果:

Test Name:   F000003                         Class Name:  Style
CPU Time:       0.781  nanoseconds           plus or minus     0.0715
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 2-way if/else if statement
 compare this test with F000004

Test Name:   F000004                         Class Name:  Style
CPU Time:        1.53  nanoseconds           plus or minus     0.0767
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 2-way switch statement
 compare this test with F000003

Test Name:   F000005                         Class Name:  Style
CPU Time:        7.70  nanoseconds           plus or minus      0.385
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way if/else if statement
 compare this test with F000006

Test Name:   F000006                         Class Name:  Style
CPU Time:        2.00  nanoseconds           plus or minus     0.0999
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way switch statement
 compare this test with F000005

Test Name:   F000007                         Class Name:  Style
CPU Time:        3.41  nanoseconds           plus or minus      0.171
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way sparse switch statement
 compare this test with F000005 and F000006

从中我们可以看到(在这台机器上,使用这个编译器——vc++ 9.0 x64),每个if测试大约需要0.7纳秒。随着测试数量的增加,时间几乎是完全线性的。

使用switch语句,2路测试和10路测试之间的速度几乎没有区别,只要值是密集的。使用稀疏值的10路测试所花费的时间大约是使用密集值的10路测试的1.6倍——但即使使用稀疏值,仍然比使用10路if/else if的速度快两倍。

底线:仅仅使用一个4-way测试并不能真正告诉你switch vs if/else的性能。如果您查看这段代码中的数字,就可以很容易地插入一个事实,即对于4路测试,我们期望两者产生非常相似的结果(If /else为~2.8纳秒,开关为~2.0纳秒)。

其他回答

我对此很感兴趣,并查看了可以对您的示例进行哪些更改以使其更快地运行switch语句。

如果你有40个If语句,并添加一个0大小写,那么If块将比等效的switch语句运行得慢。我在这里有结果:https://www.ideone.com/KZeCz。

删除0大小写的效果可以在这里看到:https://www.ideone.com/LFnrX。

编译器可以对交换机进行几种优化。我不认为经常提到的“跳跃表”是非常有用的,因为它只在输入可以以某种方式被限制时才有效。

C“跳转表”的伪代码是这样的——注意,实际上编译器需要在表周围插入某种形式的if测试,以确保输入在表中是有效的。还要注意,它只在输入是连续数字的特定情况下才有效。

如果交换机中的分支数量非常大,编译器可以对交换机的值使用二进制搜索,这(在我看来)将是一个更有用的优化,因为它在某些场景中显著提高了性能,与交换机一样通用,并且不会导致更大的生成代码大小。但是要看到这一点,您的测试代码将需要更多的分支来查看任何差异。

回答您的具体问题:

Clang generates one that looks like this: test_switch(char): # @test_switch(char) movl %edi, %eax cmpl $19, %edi jbe .LBB0_1 retq .LBB0_1: jmpq *.LJTI0_0(,%rax,8) jmp void call<0u>() # TAILCALL jmp void call<1u>() # TAILCALL jmp void call<2u>() # TAILCALL jmp void call<3u>() # TAILCALL jmp void call<4u>() # TAILCALL jmp void call<5u>() # TAILCALL jmp void call<6u>() # TAILCALL jmp void call<7u>() # TAILCALL jmp void call<8u>() # TAILCALL jmp void call<9u>() # TAILCALL jmp void call<10u>() # TAILCALL jmp void call<11u>() # TAILCALL jmp void call<12u>() # TAILCALL jmp void call<13u>() # TAILCALL jmp void call<14u>() # TAILCALL jmp void call<15u>() # TAILCALL jmp void call<16u>() # TAILCALL jmp void call<17u>() # TAILCALL jmp void call<18u>() # TAILCALL jmp void call<19u>() # TAILCALL .LJTI0_0: .quad .LBB0_2 .quad .LBB0_3 .quad .LBB0_4 .quad .LBB0_5 .quad .LBB0_6 .quad .LBB0_7 .quad .LBB0_8 .quad .LBB0_9 .quad .LBB0_10 .quad .LBB0_11 .quad .LBB0_12 .quad .LBB0_13 .quad .LBB0_14 .quad .LBB0_15 .quad .LBB0_16 .quad .LBB0_17 .quad .LBB0_18 .quad .LBB0_19 .quad .LBB0_20 .quad .LBB0_21 I can say that it is not using a jump table -- 4 comparison instructions are clearly visible: 13FE81C51 cmp qword ptr [rsp+30h],1 13FE81C57 je testSwitch+73h (13FE81C73h) 13FE81C59 cmp qword ptr [rsp+30h],2 13FE81C5F je testSwitch+87h (13FE81C87h) 13FE81C61 cmp qword ptr [rsp+30h],3 13FE81C67 je testSwitch+9Bh (13FE81C9Bh) 13FE81C69 cmp qword ptr [rsp+30h],4 13FE81C6F je testSwitch+0AFh (13FE81CAFh) A jump table based solution does not use comparison at all. Either not enough branches to cause the compiler to generate a jump table, or your compiler simply doesn't generate them. I'm not sure which.

EDIT 2014: There has been some discussion elsewhere from people familiar with the LLVM optimizer saying that the jump table optimization can be important in many scenarios; e.g. in cases where there is an enumeration with many values and many cases against values in said enumeration. That said, I stand by what I said above in 2011 -- too often I see people thinking "if I make it a switch, it'll be the same time no matter how many cases I have" -- and that's completely false. Even with a jump table you get the indirect jump cost and you pay for entries in the table for each case; and memory bandwidth is a Big Deal on modern hardware.

为可读性编写代码。任何有价值的编译器都会看到一个if / else if阶梯,并将其转换为等效的开关,反之亦然,如果这样做会更快的话。

I'll answer 2) and make some general comments. 2) No, there is no jump table in the assembly code you've posted. A jump table is a table of jump destinations, and one or two instructions to jump directly to an indexed location from the table. A jump table would make more sense when there are many possible switch destinations. Maybe the optimiser knows that simple if else logic is faster unless the number of destinations is greater than some threshold. Try your example again with say 20 possibilities instead of 4.

但不知道为什么一个更快,一个更慢。

这其实不难解释……如果你还记得错误预测的分支比正确预测的分支贵几十到几百倍。

In the % 20 version, the first case/if is always the one that hits. Modern CPUs "learn" which branches are usually taken and which are not, so they can easily predict how this branch will behave on almost every iteration of the loop. That explains why the "if" version flies; it never has to execute anything past the first test, and it (correctly) predicts the result of that test for most of the iterations. Obviously the "switch" is implemented slightly differently -- perhaps even a jump table, which can be slow thanks to the computed branch.

在% 21版本中,分支基本上是随机的。因此,不仅每次迭代都有许多线程执行,CPU也无法猜测它们将朝哪个方向运行。在这种情况下,跳转表(或其他“切换”优化)可能会有所帮助。

很难预测一段代码在现代编译器和CPU上的表现,而且每一代都会变得更加困难。最好的建议是:“连试都不用试;总是概要”。每年,这种建议都会变得越来越好,而能够成功忽略它的人也会越来越少。

所有这些都是说,我上面的解释在很大程度上是一种猜测。: -)

一个好的优化编译器,如MSVC可以生成:

一个简单的跳跃表,如果情况被安排在一个很好的长范围 如果有很多空隙,则使用稀疏(两级)跳转表 如果情况的数量很小或值为 不要靠得太近 以上的组合,如果情况代表几组 间隔太近。

简而言之,如果转换看起来比一系列if慢,编译器可能会直接将它转换为一个if。它很可能不仅仅是每个情况下的比较序列,而是一个二叉搜索树。请看这里的例子。