下面的Java程序平均需要0.50到0.55秒的时间来运行:

public static void main(String[] args) {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * (i * i);
    }
    System.out.println(
        (double) (System.nanoTime() - startTime) / 1000000000 + " s");
    System.out.println("n = " + n);
}

如果我将2 * (I * I)替换为2 * I * I,它将花费0.60到0.65秒的时间运行。如何来吗?

我把程序的每个版本都运行了15次,在两者之间交替运行。以下是调查结果:

 2*(i*i)  │  2*i*i
──────────┼──────────
0.5183738 │ 0.6246434
0.5298337 │ 0.6049722
0.5308647 │ 0.6603363
0.5133458 │ 0.6243328
0.5003011 │ 0.6541802
0.5366181 │ 0.6312638
0.515149  │ 0.6241105
0.5237389 │ 0.627815
0.5249942 │ 0.6114252
0.5641624 │ 0.6781033
0.538412  │ 0.6393969
0.5466744 │ 0.6608845
0.531159  │ 0.6201077
0.5048032 │ 0.6511559
0.5232789 │ 0.6544526

2 * i * i的最快运行时间比2 * (i * i)的最慢运行时间长。如果它们具有相同的效率,发生这种情况的概率将小于1/2^15 * 100% = 0.00305%。


当前回答

(编者注:正如另一个答案所示,这个答案与来自asm的证据相矛盾。这个猜测得到了一些实验的支持,但结果并不正确。)


当乘法是2 * (i * i)时,JVM能够从循环中分解出乘法2,从而得到等效但更高效的代码:

int n = 0;
for (int i = 0; i < 1000000000; i++) {
    n += i * i;
}
n *= 2;

但是当乘法是(2 * i) * i时,JVM不会优化它,因为乘以常数不再恰好在n +=加法之前。

以下是我认为这种情况的几个原因:

在循环开始时添加if (n == 0) n = 1语句会导致两个版本的效率一样高,因为分解乘法不再保证结果相同 优化后的版本(通过分解2的乘法)与2 * (i * i)版本一样快

下面是我用来得出这些结论的测试代码:

public static void main(String[] args) {
    long fastVersion = 0;
    long slowVersion = 0;
    long optimizedVersion = 0;
    long modifiedFastVersion = 0;
    long modifiedSlowVersion = 0;

    for (int i = 0; i < 10; i++) {
        fastVersion += fastVersion();
        slowVersion += slowVersion();
        optimizedVersion += optimizedVersion();
        modifiedFastVersion += modifiedFastVersion();
        modifiedSlowVersion += modifiedSlowVersion();
    }

    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
}

private static long fastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
}

private static long slowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
}

private static long optimizedVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += i * i;
    }
    n *= 2;
    return System.nanoTime() - startTime;
}

private static long modifiedFastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        if (n == 0) n = 1;
        n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
}

private static long modifiedSlowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        if (n == 0) n = 1;
        n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
}

结果如下:

Fast version: 5.7274411 s
Slow version: 7.6190804 s
Optimized version: 5.1348007 s
Modified fast version: 7.1492705 s
Modified slow version: 7.2952668 s

其他回答

字节码:https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html 字节码查看器:https://github.com/Konloch/bytecode-viewer

在我的JDK (Windows 10 64位,1.8.0_65-b17)上,我可以复制并解释:

public static void main(String[] args) {
    int repeat = 10;
    long A = 0;
    long B = 0;
    for (int i = 0; i < repeat; i++) {
        A += test();
        B += testB();
    }

    System.out.println(A / repeat + " ms");
    System.out.println(B / repeat + " ms");
}


private static long test() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
        n += multi(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
        n += multi(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms A " + n);
    return ms;
}


private static long testB() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
        n += multiB(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
        n += multiB(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms B " + n);
    return ms;
}

private static int multiB(int i) {
    return 2 * (i * i);
}

private static int multi(int i) {
    return 2 * i * i;
}

输出:

...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms

所以为什么? 字节代码如下:

 private static multiB(int arg0) { // 2 * (i * i)
     <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>

     L1 {
         iconst_2
         iload0
         iload0
         imul
         imul
         ireturn
     }
     L2 {
     }
 }

 private static multi(int arg0) { // 2 * i * i
     <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>

     L1 {
         iconst_2
         iload0
         imul
         iload0
         imul
         ireturn
     }
     L2 {
     }
 }

区别在于: 括号(2 * (i * i)):

Push const堆栈 将本地文件推入堆栈 将本地文件推入堆栈 将堆栈顶部相乘 将堆栈顶部相乘

不带括号(2 * i * i):

Push const堆栈 将本地文件推入堆栈 将堆栈顶部相乘 将本地文件推入堆栈 将堆栈顶部相乘

将所有内容加载到堆栈,然后再返回,这比在添加堆栈和操作堆栈之间切换要快。

我得到了类似的结果:

2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736

如果两个循环都在同一个程序中,或者每个循环都在单独的.java文件/.class中,在单独的运行中执行,我得到了相同的结果。

最后,这里是一个javap -c -v <.java>的反编译:

     3: ldc           #3                  // String 2 * (i * i):
     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
    11: lstore_1
    12: iconst_0
    13: istore_3
    14: iconst_0
    15: istore        4
    17: iload         4
    19: ldc           #6                  // int 1000000000
    21: if_icmpge     40
    24: iload_3
    25: iconst_2
    26: iload         4
    28: iload         4
    30: imul
    31: imul
    32: iadd
    33: istore_3
    34: iinc          4, 1
    37: goto          17

vs.

     3: ldc           #3                  // String 2 * i * i:
     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
    11: lstore_1
    12: iconst_0
    13: istore_3
    14: iconst_0
    15: istore        4
    17: iload         4
    19: ldc           #6                  // int 1000000000
    21: if_icmpge     40
    24: iload_3
    25: iconst_2
    26: iload         4
    28: imul
    29: iload         4
    31: imul
    32: iadd
    33: istore_3
    34: iinc          4, 1
    37: goto          17

仅供参考,

java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Kasperd在对公认答案的评论中问道:

Java和C示例使用了完全不同的寄存器名称。这两个例子都使用AMD64 ISA?

xor edx, edx
xor eax, eax
.L2:
mov ecx, edx
imul ecx, edx
add edx, 1
lea eax, [rax+rcx*2]
cmp edx, 1000000000
jne .L2

我没有足够的声誉在评论中回答这个问题,但这些都是相同的ISA。值得指出的是,GCC版本使用32位整数逻辑,而JVM编译版本内部使用64位整数逻辑。

R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.

更像是一个附录。我使用IBM最新的Java 8 JVM重现了这个实验:

java version "1.8.0_191"
Java(TM) 2 Runtime Environment, Standard Edition (IBM build 1.8.0_191-b12 26_Oct_2018_18_45 Mac OS X x64(SR5 FP25))
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

这显示了非常相似的结果:

0.374653912 s
n = 119860736
0.447778698 s
n = 119860736

(第二个结果使用2 * I * I)。

有趣的是,当在同一台机器上运行,但使用Oracle Java时:

Java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

结果平均来说有点慢:

0.414331815 s
n = 119860736
0.491430656 s
n = 119860736

长话短说:即使HotSpot的小版本号在这里也很重要,因为JIT实现中的细微差异可能会产生显著的影响。

添加的两种方法生成的字节代码略有不同:

  17: iconst_2
  18: iload         4
  20: iload         4
  22: imul
  23: imul
  24: iadd

对于2 * (i * i) vs:

  17: iconst_2
  18: iload         4
  20: imul
  21: iload         4
  23: imul
  24: iadd

对于2 * i * i。

当像这样使用JMH基准时:

@Warmup(iterations = 5, batchSize = 1)
@Measurement(iterations = 5, batchSize = 1)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MyBenchmark {

    @Benchmark
    public int noBrackets() {
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            n += 2 * i * i;
        }
        return n;
    }

    @Benchmark
    public int brackets() {
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            n += 2 * (i * i);
        }
        return n;
    }

}

区别很明显:

# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: <none>

Benchmark                      (n)  Mode  Cnt    Score    Error  Units
MyBenchmark.brackets    1000000000  avgt    5  380.889 ± 58.011  ms/op
MyBenchmark.noBrackets  1000000000  avgt    5  512.464 ± 11.098  ms/op

你观察到的是正确的,而不仅仅是你的基准测试风格的异常(例如,没有热身,参见如何用Java编写正确的微基准测试?)

再次与Graal一起运行:

# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler

Benchmark                      (n)  Mode  Cnt    Score    Error  Units
MyBenchmark.brackets    1000000000  avgt    5  335.100 ± 23.085  ms/op
MyBenchmark.noBrackets  1000000000  avgt    5  331.163 ± 50.670  ms/op

您可以看到,结果更加接近,这是有道理的,因为Graal是一个整体性能更好、更现代的编译器。

因此,这实际上只是取决于JIT编译器优化特定代码段的能力,并不一定有逻辑上的原因。