我正在用Java写一些代码,在某些时候,程序的流是由两个int变量“a”和“b”是否为非零决定的(注意:a和b永远不为负,并且永远不在整数溢出范围内)。

我可以用

if (a != 0 && b != 0) { /* Some code */ }

或者

if (a*b != 0) { /* Some code */ }

因为我希望这段代码每次运行数百万次,所以我想知道哪一个会更快。我通过在一个巨大的随机生成的数组中比较它们来做实验,我也很好奇数组的稀疏性(fraction of data = 0)会如何影响结果:

long time;
final int len = 50000000;
int arbitrary = 0;
int[][] nums = new int[2][len];

for (double fraction = 0 ; fraction <= 0.9 ; fraction += 0.0078125) {
    for(int i = 0 ; i < 2 ; i++) {
        for(int j = 0 ; j < len ; j++) {
            double random = Math.random();

            if(random < fraction) nums[i][j] = 0;
            else nums[i][j] = (int) (random*15 + 1);
        }
    }

    time = System.currentTimeMillis();

    for(int i = 0 ; i < len ; i++) {
        if( /*insert nums[0][i]*nums[1][i]!=0 or nums[0][i]!=0 && nums[1][i]!=0*/ ) arbitrary++;
    }
    System.out.println(System.currentTimeMillis() - time);
}

结果表明,如果你期望“a”或“b”等于0的概率超过~3%,a*b != 0比a!=0 && b!= 0:

我很想知道为什么。谁能给我一点启示?是编译器还是硬件层?

编辑:出于好奇……现在我了解了分支预测,我想知道如果a或b是非零,模拟比较会显示什么:

我们确实看到了与预期相同的分支预测效果,有趣的是,图在x轴上有些翻转。

更新

1-我添加!(a==0 || b==0)到分析中,看看会发生什么。

2-我也加入了a != 0 || b != 0, (a+b) != 0和(a|b) != 0,在学习了分支预测之后。但它们在逻辑上并不等同于其他表达式,因为只有a或b需要非零才能返回true,因此它们不用于比较处理效率。

3-我还添加了我用于分析的实际基准测试,这只是迭代一个任意int变量。

4-有些人建议包含a != 0 & b != 0,而不是a != 0 && b != 0,并预测它将更接近a*b != 0,因为我们将删除分支预测效应。我不知道&可以用于布尔变量,我以为它只用于整数的二进制操作。

注意:在我考虑所有这些的上下文中,int溢出不是一个问题,但在一般上下文中,这肯定是一个重要的考虑因素。

CPU: Intel Core i7-3610QM @ 2.3GHz

Java版本:1.8.0_45 Java(TM) SE运行时环境(build 1.8.0_45-b14) Java HotSpot(TM) 64位服务器虚拟机(build 25.45-b02,混合模式)


当前回答

这里的答案都很好,尽管我有一个想法可能会改善情况。

由于两个分支和相关的分支预测可能是罪魁祸首,我们可能能够在不改变逻辑的情况下将分支减少到单个分支。

bool aNotZero = (nums[0][i] != 0);
bool bNotZero = (nums[1][i] != 0);
if (aNotZero && bNotZero) { /* Some code */ }

这也可能是可行的

int a = nums[0][i];
int b = nums[1][i];
if (a != 0 && b != 0) { /* Some code */ }

The reason being, by the rules of short circuiting, if the first boolean is false, the second should not be evaluated. It has to perform an extra branch to avoid evaluating nums[1][i] if nums[0][i] was false. Now, you may not care that nums[1][i] gets evaluated, but the compiler can't be certain that it won't throw an out of range or null ref when you do. By reducing the if block to simple bools, the compiler may be smart enough to realize that evaluating the second boolean unnecessarily won't have negative side effects.

其他回答

我认为你的基准测试有一些缺陷,可能对推断实际程序没有用处。以下是我的想法:

(a|b)!=0 and (a+b)!=0 test if either value is non-zero, whereas a != 0 && b != 0 and (a*b)!=0 test if both are non-zero. So you are not comparing the timing of just the arithmetic: if the condition is true more often, it causes more executions of the if body, which takes more time too. (a+b)!=0 will do the wrong thing for positive and negative values that sum to zero, so you can't use it in the general case, even if it works here. Also for a=b=0x80000000 (MIN_VALUE), the only set bit will overflow out the top. Similarly, (a*b)!=0 will do the wrong thing for values that overflow. Random example: 196608 * 327680 is 0 because the true result happens to be divisible by 232, so its low 32 bits are 0, and those bits are all you get if it's an int operation. The VM will optimize the expression during the first few runs of the outer (fraction) loop, when fraction is 0, when the branches are almost never taken. The optimizer may do different things if you start fraction at 0.5. Unless the VM is able to eliminate some of the array bounds checks here, there are four other branches in the expression just due to the bounds checks, and that's a complicating factor when trying to figure out what's happening at a low level. You might get different results if you split the two-dimensional array into two flat arrays, changing nums[0][i] and nums[1][i] to nums0[i] and nums1[i]. CPU branch predictors detect short patterns in the data, or runs of all branches being taken or not taken. Your randomly generated benchmark data is the worst-case scenario for a branch predictor. If real-world data has a predictable pattern, or it has long runs of all-zero and all-non-zero values, the branches could cost much less. The particular code that is executed after the condition is met can affect the performance of evaluating the condition itself, because it affects things like whether or not the loop can be unrolled, which CPU registers are available, and if any of the fetched nums values need to be reused after evaluating the condition. Merely incrementing a counter in the benchmark is not a perfect placeholder for what real code would do. System.currentTimeMillis() is on most systems not more accurate than +/- 10 ms. System.nanoTime() is usually more accurate.

有很多不确定性,对于这些类型的微优化总是很难确定,因为在一个VM或CPU上更快的技巧可能在另一个VM或CPU上更慢。如果运行的是32位HotSpot JVM,而不是64位版本,请注意它有两种风格:与“服务器”VM相比,“客户端”VM具有不同的(较弱的)优化。

如果您可以分解VM生成的机器代码,那么就这样做,而不是试图猜测它是做什么的!

我知道这个问题很老了,但出于好奇和训练,我用JMH做了一些测试。结果略有不同:

位或运算(a | b != 0)和乘法运算(a * b != 0)是最快的; 正常的和(a!=0 & b!=0)几乎和乘法一样好 短路AND和OR (a!=0 && b!= 0, ! !=0 || b!=0)最慢

免责声明:我甚至不是JMH的专家。

在这里的代码,我试图重现有问题的代码,添加了位或:

@Warmup(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Fork(value = 3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MultAnd {
    
    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
            .include(MultAnd.class.getSimpleName())
            .build();
        
        new Runner(opt).run();
    }

    private static final int size = 50_000_000;
    
    @Param({"0.00", "0.10", "0.20", "0.30", "0.40", "0.45", 
            "0.50", "0.55", "0.60", "0.70", "0.80", "0.90", 
            "1.00"})
    private double fraction;

    private int[][] nums;

    @Setup
    public void setup() {
        nums = new int[2][size];
        for(int i = 0 ; i < 2 ; i++) {
            for(int j = 0 ; j < size ; j++) {
                double random = Math.random();
                if (random < fraction) 
                    nums[i][j] = 0;
                else 
                    nums[i][j] = (int) (random*15 + 1);
            }
        }
    }
    
    @Benchmark
    public int and() {
        int s = 0;
        for (int i = 0; i < size; i++) {
            if ((nums[0][i]!=0) & (nums[1][i]!=0))
                s++;
        }
        return s;
    }
    
    @Benchmark
    public int andAnd() {
        int s = 0;
        for (int i = 0; i < size; i++) {
            if ((nums[0][i]!=0) && (nums[1][i]!=0))
                s++;
        }
        return s;
    }
    
    @Benchmark
    public int bitOr() {
        int s = 0;
        for (int i = 0; i < size; i++) {
            if ((nums[0][i] | nums[1][i]) != 0)
                s++;
        }
        return s;
    }
    
    @Benchmark
    public int mult() {
        int s = 0;
        for (int i = 0; i < size; i++) {
            if (nums[0][i]*nums[1][i] != 0)
                s++;
        }
        return s;
    }
    
    @Benchmark
    public int notOrOr() {
        int s = 0;
        for (int i = 0; i < size; i++) {
            if (!((nums[0][i]!=0) || (nums[1][i]!=0)))
                s++;
        }
        return s;
    }
}

结果是:

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark        (fraction)  Mode  Cnt    Score    Error  Units
MultAnd.and            0.00  avgt   30   33.238 ±  0.883  ms/op
MultAnd.and            0.10  avgt   30   48.011 ±  1.635  ms/op
MultAnd.and            0.20  avgt   30   48.284 ±  1.797  ms/op
MultAnd.and            0.30  avgt   30   47.969 ±  1.463  ms/op
MultAnd.and            0.40  avgt   30   48.999 ±  2.881  ms/op
MultAnd.and            0.45  avgt   30   47.804 ±  1.053  ms/op
MultAnd.and            0.50  avgt   30   48.332 ±  1.990  ms/op
MultAnd.and            0.55  avgt   30   47.457 ±  1.210  ms/op
MultAnd.and            0.60  avgt   30  127.530 ±  3.104  ms/op
MultAnd.and            0.70  avgt   30   92.630 ±  1.471  ms/op
MultAnd.and            0.80  avgt   30   69.458 ±  1.609  ms/op
MultAnd.and            0.90  avgt   30   55.421 ±  1.443  ms/op
MultAnd.and            1.00  avgt   30   30.672 ±  1.387  ms/op
MultAnd.andAnd         0.00  avgt   30   33.187 ±  0.978  ms/op
MultAnd.andAnd         0.10  avgt   30  110.943 ±  1.435  ms/op
MultAnd.andAnd         0.20  avgt   30  177.527 ±  2.353  ms/op
MultAnd.andAnd         0.30  avgt   30  226.247 ±  1.879  ms/op
MultAnd.andAnd         0.40  avgt   30  266.316 ± 18.647  ms/op
MultAnd.andAnd         0.45  avgt   30  258.120 ±  2.638  ms/op
MultAnd.andAnd         0.50  avgt   30  259.727 ±  3.532  ms/op
MultAnd.andAnd         0.55  avgt   30  248.706 ±  1.419  ms/op
MultAnd.andAnd         0.60  avgt   30  229.825 ±  1.256  ms/op
MultAnd.andAnd         0.70  avgt   30  177.911 ±  2.787  ms/op
MultAnd.andAnd         0.80  avgt   30  121.303 ±  2.724  ms/op
MultAnd.andAnd         0.90  avgt   30   67.914 ±  1.589  ms/op
MultAnd.andAnd         1.00  avgt   30   15.892 ±  0.349  ms/op
MultAnd.bitOr          0.00  avgt   30   33.271 ±  1.896  ms/op
MultAnd.bitOr          0.10  avgt   30   35.597 ±  1.536  ms/op
MultAnd.bitOr          0.20  avgt   30   42.366 ±  1.641  ms/op
MultAnd.bitOr          0.30  avgt   30   58.385 ±  0.778  ms/op
MultAnd.bitOr          0.40  avgt   30   85.567 ±  1.678  ms/op
MultAnd.bitOr          0.45  avgt   30   32.152 ±  1.345  ms/op
MultAnd.bitOr          0.50  avgt   30   32.190 ±  1.357  ms/op
MultAnd.bitOr          0.55  avgt   30   32.335 ±  1.384  ms/op
MultAnd.bitOr          0.60  avgt   30   31.910 ±  1.026  ms/op
MultAnd.bitOr          0.70  avgt   30   31.783 ±  0.807  ms/op
MultAnd.bitOr          0.80  avgt   30   31.671 ±  0.745  ms/op
MultAnd.bitOr          0.90  avgt   30   31.329 ±  0.708  ms/op
MultAnd.bitOr          1.00  avgt   30   30.530 ±  0.534  ms/op
MultAnd.mult           0.00  avgt   30   30.859 ±  0.735  ms/op
MultAnd.mult           0.10  avgt   30   33.933 ±  0.887  ms/op
MultAnd.mult           0.20  avgt   30   34.243 ±  0.917  ms/op
MultAnd.mult           0.30  avgt   30   33.825 ±  1.155  ms/op
MultAnd.mult           0.40  avgt   30   34.309 ±  1.334  ms/op
MultAnd.mult           0.45  avgt   30   33.709 ±  1.277  ms/op
MultAnd.mult           0.50  avgt   30   37.699 ±  4.029  ms/op
MultAnd.mult           0.55  avgt   30   36.374 ±  2.948  ms/op
MultAnd.mult           0.60  avgt   30  100.354 ±  1.553  ms/op
MultAnd.mult           0.70  avgt   30   69.570 ±  1.441  ms/op
MultAnd.mult           0.80  avgt   30   47.181 ±  1.567  ms/op
MultAnd.mult           0.90  avgt   30   33.552 ±  0.999  ms/op
MultAnd.mult           1.00  avgt   30   30.775 ±  0.548  ms/op
MultAnd.notOrOr        0.00  avgt   30   15.701 ±  0.254  ms/op
MultAnd.notOrOr        0.10  avgt   30   68.052 ±  1.433  ms/op
MultAnd.notOrOr        0.20  avgt   30  120.393 ±  2.299  ms/op
MultAnd.notOrOr        0.30  avgt   30  177.729 ±  2.438  ms/op
MultAnd.notOrOr        0.40  avgt   30  229.547 ±  1.859  ms/op
MultAnd.notOrOr        0.45  avgt   30  250.660 ±  4.810  ms/op
MultAnd.notOrOr        0.50  avgt   30  258.760 ±  2.190  ms/op
MultAnd.notOrOr        0.55  avgt   30  258.010 ±  1.018  ms/op
MultAnd.notOrOr        0.60  avgt   30  254.732 ±  2.076  ms/op
MultAnd.notOrOr        0.70  avgt   30  227.148 ±  2.040  ms/op
MultAnd.notOrOr        0.80  avgt   30  180.193 ±  4.659  ms/op
MultAnd.notOrOr        0.90  avgt   30  112.212 ±  3.111  ms/op
MultAnd.notOrOr        1.00  avgt   30   33.458 ±  0.952  ms/op

或者用图表表示:

这里的答案都很好,尽管我有一个想法可能会改善情况。

由于两个分支和相关的分支预测可能是罪魁祸首,我们可能能够在不改变逻辑的情况下将分支减少到单个分支。

bool aNotZero = (nums[0][i] != 0);
bool bNotZero = (nums[1][i] != 0);
if (aNotZero && bNotZero) { /* Some code */ }

这也可能是可行的

int a = nums[0][i];
int b = nums[1][i];
if (a != 0 && b != 0) { /* Some code */ }

The reason being, by the rules of short circuiting, if the first boolean is false, the second should not be evaluated. It has to perform an extra branch to avoid evaluating nums[1][i] if nums[0][i] was false. Now, you may not care that nums[1][i] gets evaluated, but the compiler can't be certain that it won't throw an out of range or null ref when you do. By reducing the if block to simple bools, the compiler may be smart enough to realize that evaluating the second boolean unnecessarily won't have negative side effects.

我忽略了你的基准测试可能有缺陷的问题,只看表面结果。

是编译器还是硬件层?

对于后者,我认为:

  if (a != 0 && b != 0)

将编译到2个内存负载和两个条件分支

  if (a * b != 0)

将编译到2个内存负载,一个乘法和一个条件分支。

如果硬件级分支预测无效,乘法可能比第二个条件分支快。随着比例的增加……分支预测变得不那么有效。

条件分支较慢的原因是它们会导致指令执行管道暂停。分支预测是通过预测分支的走向,并在此基础上推测地选择下一条指令来避免失速。如果预测失败,则在加载另一个方向的指令时有延迟。

(注:以上解释过于简化。要获得更准确的解释,您需要查看CPU制造商为汇编语言编码器和编译器编写者提供的文献。维基百科上关于分支预测器的页面是很好的背景。)


然而,在进行此优化时,有一件事需要注意。是否存在a * b != 0会给出错误答案的值?考虑计算乘积导致整数溢出的情况。


更新

你的图表似乎证实了我说的话。

在条件分支a * b != 0的情况下,还有一个“分支预测”效果,这在图中体现出来。 如果你在X轴上投影超过0.9的曲线,看起来像1)它们将在大约1.0处相遇,2)相遇点将与X = 0.0时大致相同的Y值。


更新2

我不明白为什么a + b != 0和a | b != 0的曲线是不同的。分支预测器逻辑中可能有一些聪明的东西。也有可能是别的意思。

(请注意,这种事情可以特定于特定的芯片型号甚至版本。在其他系统上,基准测试的结果可能不同。)

然而,它们都有适用于所有a和b的非负值的优点。

您正在使用随机输入数据,这使得分支不可预测。在实践中,分支通常(~90%)是可预测的,因此在实际代码中,分支代码可能更快。

也就是说。我不明白a*b != 0怎么会比(a|b) != 0快。一般来说,整数乘法比位或运算代价更大。但像这样的事情偶尔会变得很奇怪。参见处理器缓存效果图库中的“示例7:硬件复杂性”示例。