下面的Java程序平均需要0.50到0.55秒的时间来运行:

public static void main(String[] args) {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * (i * i);
    }
    System.out.println(
        (double) (System.nanoTime() - startTime) / 1000000000 + " s");
    System.out.println("n = " + n);
}

如果我将2 * (I * I)替换为2 * I * I,它将花费0.60到0.65秒的时间运行。如何来吗?

我把程序的每个版本都运行了15次,在两者之间交替运行。以下是调查结果:

 2*(i*i)  │  2*i*i
──────────┼──────────
0.5183738 │ 0.6246434
0.5298337 │ 0.6049722
0.5308647 │ 0.6603363
0.5133458 │ 0.6243328
0.5003011 │ 0.6541802
0.5366181 │ 0.6312638
0.515149  │ 0.6241105
0.5237389 │ 0.627815
0.5249942 │ 0.6114252
0.5641624 │ 0.6781033
0.538412  │ 0.6393969
0.5466744 │ 0.6608845
0.531159  │ 0.6201077
0.5048032 │ 0.6511559
0.5232789 │ 0.6544526

2 * i * i的最快运行时间比2 * (i * i)的最慢运行时间长。如果它们具有相同的效率,发生这种情况的概率将小于1/2^15 * 100% = 0.00305%。


当前回答

Kasperd在对公认答案的评论中问道:

Java和C示例使用了完全不同的寄存器名称。这两个例子都使用AMD64 ISA?

xor edx, edx
xor eax, eax
.L2:
mov ecx, edx
imul ecx, edx
add edx, 1
lea eax, [rax+rcx*2]
cmp edx, 1000000000
jne .L2

我没有足够的声誉在评论中回答这个问题,但这些都是相同的ISA。值得指出的是,GCC版本使用32位整数逻辑,而JVM编译版本内部使用64位整数逻辑。

R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.

其他回答

(编者注:正如另一个答案所示,这个答案与来自asm的证据相矛盾。这个猜测得到了一些实验的支持,但结果并不正确。)


当乘法是2 * (i * i)时,JVM能够从循环中分解出乘法2,从而得到等效但更高效的代码:

int n = 0;
for (int i = 0; i < 1000000000; i++) {
    n += i * i;
}
n *= 2;

但是当乘法是(2 * i) * i时,JVM不会优化它,因为乘以常数不再恰好在n +=加法之前。

以下是我认为这种情况的几个原因:

在循环开始时添加if (n == 0) n = 1语句会导致两个版本的效率一样高,因为分解乘法不再保证结果相同 优化后的版本(通过分解2的乘法)与2 * (i * i)版本一样快

下面是我用来得出这些结论的测试代码:

public static void main(String[] args) {
    long fastVersion = 0;
    long slowVersion = 0;
    long optimizedVersion = 0;
    long modifiedFastVersion = 0;
    long modifiedSlowVersion = 0;

    for (int i = 0; i < 10; i++) {
        fastVersion += fastVersion();
        slowVersion += slowVersion();
        optimizedVersion += optimizedVersion();
        modifiedFastVersion += modifiedFastVersion();
        modifiedSlowVersion += modifiedSlowVersion();
    }

    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
}

private static long fastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
}

private static long slowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
}

private static long optimizedVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += i * i;
    }
    n *= 2;
    return System.nanoTime() - startTime;
}

private static long modifiedFastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        if (n == 0) n = 1;
        n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
}

private static long modifiedSlowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        if (n == 0) n = 1;
        n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
}

结果如下:

Fast version: 5.7274411 s
Slow version: 7.6190804 s
Optimized version: 5.1348007 s
Modified fast version: 7.1492705 s
Modified slow version: 7.2952668 s

字节码:https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html 字节码查看器:https://github.com/Konloch/bytecode-viewer

在我的JDK (Windows 10 64位,1.8.0_65-b17)上,我可以复制并解释:

public static void main(String[] args) {
    int repeat = 10;
    long A = 0;
    long B = 0;
    for (int i = 0; i < repeat; i++) {
        A += test();
        B += testB();
    }

    System.out.println(A / repeat + " ms");
    System.out.println(B / repeat + " ms");
}


private static long test() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
        n += multi(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
        n += multi(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms A " + n);
    return ms;
}


private static long testB() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
        n += multiB(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
        n += multiB(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms B " + n);
    return ms;
}

private static int multiB(int i) {
    return 2 * (i * i);
}

private static int multi(int i) {
    return 2 * i * i;
}

输出:

...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms

所以为什么? 字节代码如下:

 private static multiB(int arg0) { // 2 * (i * i)
     <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>

     L1 {
         iconst_2
         iload0
         iload0
         imul
         imul
         ireturn
     }
     L2 {
     }
 }

 private static multi(int arg0) { // 2 * i * i
     <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>

     L1 {
         iconst_2
         iload0
         imul
         iload0
         imul
         ireturn
     }
     L2 {
     }
 }

区别在于: 括号(2 * (i * i)):

Push const堆栈 将本地文件推入堆栈 将本地文件推入堆栈 将堆栈顶部相乘 将堆栈顶部相乘

不带括号(2 * i * i):

Push const堆栈 将本地文件推入堆栈 将堆栈顶部相乘 将本地文件推入堆栈 将堆栈顶部相乘

将所有内容加载到堆栈,然后再返回,这比在添加堆栈和操作堆栈之间切换要快。

我尝试了一个使用默认原型的JMH:我还添加了一个基于Runemoro解释的优化版本。

@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
//@BenchmarkMode({ Mode.All })
@BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
  @Param({ "100", "1000", "1000000000" })
  private int size;

  @Benchmark
  public int two_square_i() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += 2 * (i * i);
    }
    return n;
  }

  @Benchmark
  public int square_i_two() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += i * i;
    }
    return 2*n;
  }

  @Benchmark
  public int two_i_() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += 2 * i * i;
    }
    return n;
  }
}

结果如下:

Benchmark                           (size)  Mode  Samples          Score   Score error  Units
o.s.MyBenchmark.square_i_two           100  avgt       10         58,062         1,410  ns/op
o.s.MyBenchmark.square_i_two          1000  avgt       10        547,393        12,851  ns/op
o.s.MyBenchmark.square_i_two    1000000000  avgt       10  540343681,267  16795210,324  ns/op
o.s.MyBenchmark.two_i_                 100  avgt       10         87,491         2,004  ns/op
o.s.MyBenchmark.two_i_                1000  avgt       10       1015,388        30,313  ns/op
o.s.MyBenchmark.two_i_          1000000000  avgt       10  967100076,600  24929570,556  ns/op
o.s.MyBenchmark.two_square_i           100  avgt       10         70,715         2,107  ns/op
o.s.MyBenchmark.two_square_i          1000  avgt       10        686,977        24,613  ns/op
o.s.MyBenchmark.two_square_i    1000000000  avgt       10  652736811,450  27015580,488  ns/op

在我的个人电脑上(酷睿i7 860 -除了在我的智能手机上阅读之外,它什么都没做):

N += i*i则N *2优先 2 * (i * i)是第二。

JVM的优化方式显然与人类不同(根据Runemoro的回答)。

现在,读取字节码:javap -c -v ./target/classes/org/sample/MyBenchmark.class

2*(i*i)(左)和2*i*i(右)的区别:https://www.diffchecker.com/cvSFppWI 2*(i*i)与优化版本的区别:https://www.diffchecker.com/I1XFu5dP

我不是字节码方面的专家,但我们在imul之前使用iload_2:这可能是你得到区别的地方:我可以假设JVM优化读取I两次(I已经在这里,不需要再次加载它),而在2* I * I中它不能。

Kasperd在对公认答案的评论中问道:

Java和C示例使用了完全不同的寄存器名称。这两个例子都使用AMD64 ISA?

xor edx, edx
xor eax, eax
.L2:
mov ecx, edx
imul ecx, edx
add edx, 1
lea eax, [rax+rcx*2]
cmp edx, 1000000000
jne .L2

我没有足够的声誉在评论中回答这个问题,但这些都是相同的ISA。值得指出的是,GCC版本使用32位整数逻辑,而JVM编译版本内部使用64位整数逻辑。

R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.

虽然与问题的环境没有直接关系,但出于好奇,我在。net Core 2.1 x64发布模式上做了同样的测试。

这是一个有趣的结果,证实了类似的现象(相反)发生在原力的黑暗面。代码:

static void Main(string[] args)
{
    Stopwatch watch = new Stopwatch();

    Console.WriteLine("2 * (i * i)");

    for (int a = 0; a < 10; a++)
    {
        int n = 0;

        watch.Restart();

        for (int i = 0; i < 1000000000; i++)
        {
            n += 2 * (i * i);
        }

        watch.Stop();

        Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds} ms");
    }

    Console.WriteLine();
    Console.WriteLine("2 * i * i");

    for (int a = 0; a < 10; a++)
    {
        int n = 0;

        watch.Restart();

        for (int i = 0; i < 1000000000; i++)
        {
            n += 2 * i * i;
        }

        watch.Stop();

        Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");
    }
}

结果:

2 * (i * i)

结果:119860736,438 ms 结果:119860736,433 ms 结果:119860736,437 ms 结果:119860736,435毫秒 结果:119860736,436 ms 结果:119860736,435毫秒 结果:119860736,435毫秒 结果:119860736,439 ms 结果:119860736,436 ms 结果:119860736,437 ms

2 * I * I

结果:119860736,417毫秒 结果:119860736,417毫秒 结果:119860736,417毫秒 结果:119860736,418 ms 结果:119860736,418 ms 结果:119860736,417毫秒 结果:119860736,418 ms 结果:119860736,416毫秒 结果:119860736,417毫秒 结果:119860736,418 ms