下面的Java程序平均需要0.50到0.55秒的时间来运行:
public static void main(String[] args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println(
(double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}
如果我将2 * (I * I)替换为2 * I * I,它将花费0.60到0.65秒的时间运行。如何来吗?
我把程序的每个版本都运行了15次,在两者之间交替运行。以下是调查结果:
2*(i*i) │ 2*i*i
──────────┼──────────
0.5183738 │ 0.6246434
0.5298337 │ 0.6049722
0.5308647 │ 0.6603363
0.5133458 │ 0.6243328
0.5003011 │ 0.6541802
0.5366181 │ 0.6312638
0.515149 │ 0.6241105
0.5237389 │ 0.627815
0.5249942 │ 0.6114252
0.5641624 │ 0.6781033
0.538412 │ 0.6393969
0.5466744 │ 0.6608845
0.531159 │ 0.6201077
0.5048032 │ 0.6511559
0.5232789 │ 0.6544526
2 * i * i的最快运行时间比2 * (i * i)的最慢运行时间长。如果它们具有相同的效率,发生这种情况的概率将小于1/2^15 * 100% = 0.00305%。
Kasperd在对公认答案的评论中问道:
Java和C示例使用了完全不同的寄存器名称。这两个例子都使用AMD64 ISA?
xor edx, edx
xor eax, eax
.L2:
mov ecx, edx
imul ecx, edx
add edx, 1
lea eax, [rax+rcx*2]
cmp edx, 1000000000
jne .L2
我没有足够的声誉在评论中回答这个问题,但这些都是相同的ISA。值得指出的是,GCC版本使用32位整数逻辑,而JVM编译版本内部使用64位整数逻辑。
R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.
添加的两种方法生成的字节代码略有不同:
17: iconst_2
18: iload 4
20: iload 4
22: imul
23: imul
24: iadd
对于2 * (i * i) vs:
17: iconst_2
18: iload 4
20: imul
21: iload 4
23: imul
24: iadd
对于2 * i * i。
当像这样使用JMH基准时:
@Warmup(iterations = 5, batchSize = 1)
@Measurement(iterations = 5, batchSize = 1)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MyBenchmark {
@Benchmark
public int noBrackets() {
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * i * i;
}
return n;
}
@Benchmark
public int brackets() {
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
return n;
}
}
区别很明显:
# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: <none>
Benchmark (n) Mode Cnt Score Error Units
MyBenchmark.brackets 1000000000 avgt 5 380.889 ± 58.011 ms/op
MyBenchmark.noBrackets 1000000000 avgt 5 512.464 ± 11.098 ms/op
你观察到的是正确的,而不仅仅是你的基准测试风格的异常(例如,没有热身,参见如何用Java编写正确的微基准测试?)
再次与Graal一起运行:
# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler
Benchmark (n) Mode Cnt Score Error Units
MyBenchmark.brackets 1000000000 avgt 5 335.100 ± 23.085 ms/op
MyBenchmark.noBrackets 1000000000 avgt 5 331.163 ± 50.670 ms/op
您可以看到,结果更加接近,这是有道理的,因为Graal是一个整体性能更好、更现代的编译器。
因此,这实际上只是取决于JIT编译器优化特定代码段的能力,并不一定有逻辑上的原因。
Kasperd在对公认答案的评论中问道:
Java和C示例使用了完全不同的寄存器名称。这两个例子都使用AMD64 ISA?
xor edx, edx
xor eax, eax
.L2:
mov ecx, edx
imul ecx, edx
add edx, 1
lea eax, [rax+rcx*2]
cmp edx, 1000000000
jne .L2
我没有足够的声誉在评论中回答这个问题,但这些都是相同的ISA。值得指出的是,GCC版本使用32位整数逻辑,而JVM编译版本内部使用64位整数逻辑。
R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.
我得到了类似的结果:
2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736
如果两个循环都在同一个程序中,或者每个循环都在单独的.java文件/.class中,在单独的运行中执行,我得到了相同的结果。
最后,这里是一个javap -c -v <.java>的反编译:
3: ldc #3 // String 2 * (i * i):
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: iload 4
30: imul
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
vs.
3: ldc #3 // String 2 * i * i:
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: imul
29: iload 4
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
仅供参考,
java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
更像是一个附录。我使用IBM最新的Java 8 JVM重现了这个实验:
java version "1.8.0_191"
Java(TM) 2 Runtime Environment, Standard Edition (IBM build 1.8.0_191-b12 26_Oct_2018_18_45 Mac OS X x64(SR5 FP25))
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
这显示了非常相似的结果:
0.374653912 s
n = 119860736
0.447778698 s
n = 119860736
(第二个结果使用2 * I * I)。
有趣的是,当在同一台机器上运行,但使用Oracle Java时:
Java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
结果平均来说有点慢:
0.414331815 s
n = 119860736
0.491430656 s
n = 119860736
长话短说:即使HotSpot的小版本号在这里也很重要,因为JIT实现中的细微差异可能会产生显著的影响。
虽然与问题的环境没有直接关系,但出于好奇,我在。net Core 2.1 x64发布模式上做了同样的测试。
这是一个有趣的结果,证实了类似的现象(相反)发生在原力的黑暗面。代码:
static void Main(string[] args)
{
Stopwatch watch = new Stopwatch();
Console.WriteLine("2 * (i * i)");
for (int a = 0; a < 10; a++)
{
int n = 0;
watch.Restart();
for (int i = 0; i < 1000000000; i++)
{
n += 2 * (i * i);
}
watch.Stop();
Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds} ms");
}
Console.WriteLine();
Console.WriteLine("2 * i * i");
for (int a = 0; a < 10; a++)
{
int n = 0;
watch.Restart();
for (int i = 0; i < 1000000000; i++)
{
n += 2 * i * i;
}
watch.Stop();
Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");
}
}
结果:
2 * (i * i)
结果:119860736,438 ms
结果:119860736,433 ms
结果:119860736,437 ms
结果:119860736,435毫秒
结果:119860736,436 ms
结果:119860736,435毫秒
结果:119860736,435毫秒
结果:119860736,439 ms
结果:119860736,436 ms
结果:119860736,437 ms
2 * I * I
结果:119860736,417毫秒
结果:119860736,417毫秒
结果:119860736,417毫秒
结果:119860736,418 ms
结果:119860736,418 ms
结果:119860736,417毫秒
结果:119860736,418 ms
结果:119860736,416毫秒
结果:119860736,417毫秒
结果:119860736,418 ms