这两个循环都是无限的,但我们可以看到哪个循环每次迭代需要更多的指令/资源。
使用gcc,我编译了以下两个程序,以不同的优化级别进行汇编:
int main(void) {
while(1) {}
return 0;
}
int main(void) {
while(2) {}
return 0;
}
即使没有优化(-O0),两个程序生成的程序集也是相同的。因此,两个循环之间没有速度差异。
作为参考,下面是生成的程序集(使用gcc main.c -S -masm=intel并带有优化标志):
o0:
.file "main.c"
.intel_syntax noprefix
.def __main; .scl 2; .type 32; .endef
.text
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
push rbp
.seh_pushreg rbp
mov rbp, rsp
.seh_setframe rbp, 0
sub rsp, 32
.seh_stackalloc 32
.seh_endprologue
call __main
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
o1群:
.file "main.c"
.intel_syntax noprefix
.def __main; .scl 2; .type 32; .endef
.text
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
sub rsp, 40
.seh_stackalloc 40
.seh_endprologue
call __main
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
-O2和-O3(相同输出):
.file "main.c"
.intel_syntax noprefix
.def __main; .scl 2; .type 32; .endef
.section .text.startup,"x"
.p2align 4,,15
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
sub rsp, 40
.seh_stackalloc 40
.seh_endprologue
call __main
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
事实上,为循环生成的程序集对于每个级别的优化都是相同的:
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
重要的部分是:
.L2:
jmp .L2
我不太懂汇编,但这显然是一个无条件循环。jmp指令无条件地将程序重置回. l2标签,甚至不将值与true进行比较,当然,它会立即再次这样做,直到程序以某种方式结束。这直接对应于C/ c++代码:
L2:
goto L2;
编辑:
有趣的是,即使没有优化,下面的循环都在汇编中产生完全相同的输出(无条件jmp):
while(42) {}
while(1==1) {}
while(2==2) {}
while(4<7) {}
while(3==3 && 4==4) {}
while(8-9 < 0) {}
while(4.3 * 3e4 >= 2 << 6) {}
while(-0.1 + 02) {}
甚至令我惊讶的是:
#include<math.h>
while(sqrt(7)) {}
while(hypot(3,4)) {}
对于用户定义函数,事情变得更加有趣:
int x(void) {
return 1;
}
while(x()) {}
#include<math.h>
double x(void) {
return sqrt(7);
}
while(x()) {}
At -O0
, these two examples actually call x
and perform a comparison for each iteration.
First example (returning 1):
.L4:
call x
testl %eax, %eax
jne .L4
movl $0, %eax
addq $32, %rsp
popq %rbp
ret
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
Second example (returning sqrt(7)
):
.L4:
call x
xorpd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
jp .L4
xorpd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
jne .L4
movl $0, %eax
addq $32, %rsp
popq %rbp
ret
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
However, at -O1
and above, they both produce the same assembly as the previous examples (an unconditional jmp
back to the preceding label).
TL;DR
Under GCC, the different loops are compiled to identical assembly. The compiler evaluates the constant values and doesn't bother performing any actual comparison.
The moral of the story is:
- There exists a layer of translation between C source code and CPU instructions, and this layer has important implications for performance.
- Therefore, performance cannot be evaluated by only looking at source code.
- The compiler should be smart enough to optimize such trivial cases. Programmers should not waste their time thinking about them in the vast majority of cases.