我正在用Java写一些代码,在某些时候,程序的流是由两个int变量“a”和“b”是否为非零决定的(注意:a和b永远不为负,并且永远不在整数溢出范围内)。
我可以用
if (a != 0 && b != 0) { /* Some code */ }
或者
if (a*b != 0) { /* Some code */ }
因为我希望这段代码每次运行数百万次,所以我想知道哪一个会更快。我通过在一个巨大的随机生成的数组中比较它们来做实验,我也很好奇数组的稀疏性(fraction of data = 0)会如何影响结果:
long time;
final int len = 50000000;
int arbitrary = 0;
int[][] nums = new int[2][len];
for (double fraction = 0 ; fraction <= 0.9 ; fraction += 0.0078125) {
for(int i = 0 ; i < 2 ; i++) {
for(int j = 0 ; j < len ; j++) {
double random = Math.random();
if(random < fraction) nums[i][j] = 0;
else nums[i][j] = (int) (random*15 + 1);
}
}
time = System.currentTimeMillis();
for(int i = 0 ; i < len ; i++) {
if( /*insert nums[0][i]*nums[1][i]!=0 or nums[0][i]!=0 && nums[1][i]!=0*/ ) arbitrary++;
}
System.out.println(System.currentTimeMillis() - time);
}
结果表明,如果你期望“a”或“b”等于0的概率超过~3%,a*b != 0比a!=0 && b!= 0:
我很想知道为什么。谁能给我一点启示?是编译器还是硬件层?
编辑:出于好奇……现在我了解了分支预测,我想知道如果a或b是非零,模拟比较会显示什么:
我们确实看到了与预期相同的分支预测效果,有趣的是,图在x轴上有些翻转。
更新
1-我添加!(a==0 || b==0)到分析中,看看会发生什么。
2-我也加入了a != 0 || b != 0, (a+b) != 0和(a|b) != 0,在学习了分支预测之后。但它们在逻辑上并不等同于其他表达式,因为只有a或b需要非零才能返回true,因此它们不用于比较处理效率。
3-我还添加了我用于分析的实际基准测试,这只是迭代一个任意int变量。
4-有些人建议包含a != 0 & b != 0,而不是a != 0 && b != 0,并预测它将更接近a*b != 0,因为我们将删除分支预测效应。我不知道&可以用于布尔变量,我以为它只用于整数的二进制操作。
注意:在我考虑所有这些的上下文中,int溢出不是一个问题,但在一般上下文中,这肯定是一个重要的考虑因素。
CPU: Intel Core i7-3610QM @ 2.3GHz
Java版本:1.8.0_45
Java(TM) SE运行时环境(build 1.8.0_45-b14)
Java HotSpot(TM) 64位服务器虚拟机(build 25.45-b02,混合模式)
我忽略了你的基准测试可能有缺陷的问题,只看表面结果。
是编译器还是硬件层?
对于后者,我认为:
if (a != 0 && b != 0)
将编译到2个内存负载和两个条件分支
if (a * b != 0)
将编译到2个内存负载,一个乘法和一个条件分支。
如果硬件级分支预测无效,乘法可能比第二个条件分支快。随着比例的增加……分支预测变得不那么有效。
条件分支较慢的原因是它们会导致指令执行管道暂停。分支预测是通过预测分支的走向,并在此基础上推测地选择下一条指令来避免失速。如果预测失败,则在加载另一个方向的指令时有延迟。
(注:以上解释过于简化。要获得更准确的解释,您需要查看CPU制造商为汇编语言编码器和编译器编写者提供的文献。维基百科上关于分支预测器的页面是很好的背景。)
然而,在进行此优化时,有一件事需要注意。是否存在a * b != 0会给出错误答案的值?考虑计算乘积导致整数溢出的情况。
更新
你的图表似乎证实了我说的话。
在条件分支a * b != 0的情况下,还有一个“分支预测”效果,这在图中体现出来。
如果你在X轴上投影超过0.9的曲线,看起来像1)它们将在大约1.0处相遇,2)相遇点将与X = 0.0时大致相同的Y值。
更新2
我不明白为什么a + b != 0和a | b != 0的曲线是不同的。分支预测器逻辑中可能有一些聪明的东西。也有可能是别的意思。
(请注意,这种事情可以特定于特定的芯片型号甚至版本。在其他系统上,基准测试的结果可能不同。)
然而,它们都有适用于所有a和b的非负值的优点。
我认为你的基准测试有一些缺陷,可能对推断实际程序没有用处。以下是我的想法:
(a|b)!=0 and (a+b)!=0 test if either value is non-zero, whereas a != 0 && b != 0 and (a*b)!=0 test if both are non-zero. So you are not comparing the timing of just the arithmetic: if the condition is true more often, it causes more executions of the if body, which takes more time too.
(a+b)!=0 will do the wrong thing for positive and negative values that sum to zero, so you can't use it in the general case, even if it works here. Also for a=b=0x80000000 (MIN_VALUE), the only set bit will overflow out the top.
Similarly, (a*b)!=0 will do the wrong thing for values that overflow. Random example: 196608 * 327680 is 0 because the true result happens to be divisible by 232, so its low 32 bits are 0, and those bits are all you get if it's an int operation.
The VM will optimize the expression during the first few runs of the outer (fraction) loop, when fraction is 0, when the branches are almost never taken. The optimizer may do different things if you start fraction at 0.5.
Unless the VM is able to eliminate some of the array bounds checks here, there are four other branches in the expression just due to the bounds checks, and that's a complicating factor when trying to figure out what's happening at a low level. You might get different results if you split the two-dimensional array into two flat arrays, changing nums[0][i] and nums[1][i] to nums0[i] and nums1[i].
CPU branch predictors detect short patterns in the data, or runs of all branches being taken or not taken. Your randomly generated benchmark data is the worst-case scenario for a branch predictor. If real-world data has a predictable pattern, or it has long runs of all-zero and all-non-zero values, the branches could cost much less.
The particular code that is executed after the condition is met can affect the performance of evaluating the condition itself, because it affects things like whether or not the loop can be unrolled, which CPU registers are available, and if any of the fetched nums values need to be reused after evaluating the condition. Merely incrementing a counter in the benchmark is not a perfect placeholder for what real code would do.
System.currentTimeMillis() is on most systems not more accurate than +/- 10 ms. System.nanoTime() is usually more accurate.
有很多不确定性,对于这些类型的微优化总是很难确定,因为在一个VM或CPU上更快的技巧可能在另一个VM或CPU上更慢。如果运行的是32位HotSpot JVM,而不是64位版本,请注意它有两种风格:与“服务器”VM相比,“客户端”VM具有不同的(较弱的)优化。
如果您可以分解VM生成的机器代码,那么就这样做,而不是试图猜测它是做什么的!
这里的答案都很好,尽管我有一个想法可能会改善情况。
由于两个分支和相关的分支预测可能是罪魁祸首,我们可能能够在不改变逻辑的情况下将分支减少到单个分支。
bool aNotZero = (nums[0][i] != 0);
bool bNotZero = (nums[1][i] != 0);
if (aNotZero && bNotZero) { /* Some code */ }
这也可能是可行的
int a = nums[0][i];
int b = nums[1][i];
if (a != 0 && b != 0) { /* Some code */ }
The reason being, by the rules of short circuiting, if the first boolean is false, the second should not be evaluated. It has to perform an extra branch to avoid evaluating nums[1][i] if nums[0][i] was false. Now, you may not care that nums[1][i] gets evaluated, but the compiler can't be certain that it won't throw an out of range or null ref when you do. By reducing the if block to simple bools, the compiler may be smart enough to realize that evaluating the second boolean unnecessarily won't have negative side effects.
我知道这个问题很老了,但出于好奇和训练,我用JMH做了一些测试。结果略有不同:
位或运算(a | b != 0)和乘法运算(a * b != 0)是最快的;
正常的和(a!=0 & b!=0)几乎和乘法一样好
短路AND和OR (a!=0 && b!= 0, ! !=0 || b!=0)最慢
免责声明:我甚至不是JMH的专家。
在这里的代码,我试图重现有问题的代码,添加了位或:
@Warmup(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Fork(value = 3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MultAnd {
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(MultAnd.class.getSimpleName())
.build();
new Runner(opt).run();
}
private static final int size = 50_000_000;
@Param({"0.00", "0.10", "0.20", "0.30", "0.40", "0.45",
"0.50", "0.55", "0.60", "0.70", "0.80", "0.90",
"1.00"})
private double fraction;
private int[][] nums;
@Setup
public void setup() {
nums = new int[2][size];
for(int i = 0 ; i < 2 ; i++) {
for(int j = 0 ; j < size ; j++) {
double random = Math.random();
if (random < fraction)
nums[i][j] = 0;
else
nums[i][j] = (int) (random*15 + 1);
}
}
}
@Benchmark
public int and() {
int s = 0;
for (int i = 0; i < size; i++) {
if ((nums[0][i]!=0) & (nums[1][i]!=0))
s++;
}
return s;
}
@Benchmark
public int andAnd() {
int s = 0;
for (int i = 0; i < size; i++) {
if ((nums[0][i]!=0) && (nums[1][i]!=0))
s++;
}
return s;
}
@Benchmark
public int bitOr() {
int s = 0;
for (int i = 0; i < size; i++) {
if ((nums[0][i] | nums[1][i]) != 0)
s++;
}
return s;
}
@Benchmark
public int mult() {
int s = 0;
for (int i = 0; i < size; i++) {
if (nums[0][i]*nums[1][i] != 0)
s++;
}
return s;
}
@Benchmark
public int notOrOr() {
int s = 0;
for (int i = 0; i < size; i++) {
if (!((nums[0][i]!=0) || (nums[1][i]!=0)))
s++;
}
return s;
}
}
结果是:
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark (fraction) Mode Cnt Score Error Units
MultAnd.and 0.00 avgt 30 33.238 ± 0.883 ms/op
MultAnd.and 0.10 avgt 30 48.011 ± 1.635 ms/op
MultAnd.and 0.20 avgt 30 48.284 ± 1.797 ms/op
MultAnd.and 0.30 avgt 30 47.969 ± 1.463 ms/op
MultAnd.and 0.40 avgt 30 48.999 ± 2.881 ms/op
MultAnd.and 0.45 avgt 30 47.804 ± 1.053 ms/op
MultAnd.and 0.50 avgt 30 48.332 ± 1.990 ms/op
MultAnd.and 0.55 avgt 30 47.457 ± 1.210 ms/op
MultAnd.and 0.60 avgt 30 127.530 ± 3.104 ms/op
MultAnd.and 0.70 avgt 30 92.630 ± 1.471 ms/op
MultAnd.and 0.80 avgt 30 69.458 ± 1.609 ms/op
MultAnd.and 0.90 avgt 30 55.421 ± 1.443 ms/op
MultAnd.and 1.00 avgt 30 30.672 ± 1.387 ms/op
MultAnd.andAnd 0.00 avgt 30 33.187 ± 0.978 ms/op
MultAnd.andAnd 0.10 avgt 30 110.943 ± 1.435 ms/op
MultAnd.andAnd 0.20 avgt 30 177.527 ± 2.353 ms/op
MultAnd.andAnd 0.30 avgt 30 226.247 ± 1.879 ms/op
MultAnd.andAnd 0.40 avgt 30 266.316 ± 18.647 ms/op
MultAnd.andAnd 0.45 avgt 30 258.120 ± 2.638 ms/op
MultAnd.andAnd 0.50 avgt 30 259.727 ± 3.532 ms/op
MultAnd.andAnd 0.55 avgt 30 248.706 ± 1.419 ms/op
MultAnd.andAnd 0.60 avgt 30 229.825 ± 1.256 ms/op
MultAnd.andAnd 0.70 avgt 30 177.911 ± 2.787 ms/op
MultAnd.andAnd 0.80 avgt 30 121.303 ± 2.724 ms/op
MultAnd.andAnd 0.90 avgt 30 67.914 ± 1.589 ms/op
MultAnd.andAnd 1.00 avgt 30 15.892 ± 0.349 ms/op
MultAnd.bitOr 0.00 avgt 30 33.271 ± 1.896 ms/op
MultAnd.bitOr 0.10 avgt 30 35.597 ± 1.536 ms/op
MultAnd.bitOr 0.20 avgt 30 42.366 ± 1.641 ms/op
MultAnd.bitOr 0.30 avgt 30 58.385 ± 0.778 ms/op
MultAnd.bitOr 0.40 avgt 30 85.567 ± 1.678 ms/op
MultAnd.bitOr 0.45 avgt 30 32.152 ± 1.345 ms/op
MultAnd.bitOr 0.50 avgt 30 32.190 ± 1.357 ms/op
MultAnd.bitOr 0.55 avgt 30 32.335 ± 1.384 ms/op
MultAnd.bitOr 0.60 avgt 30 31.910 ± 1.026 ms/op
MultAnd.bitOr 0.70 avgt 30 31.783 ± 0.807 ms/op
MultAnd.bitOr 0.80 avgt 30 31.671 ± 0.745 ms/op
MultAnd.bitOr 0.90 avgt 30 31.329 ± 0.708 ms/op
MultAnd.bitOr 1.00 avgt 30 30.530 ± 0.534 ms/op
MultAnd.mult 0.00 avgt 30 30.859 ± 0.735 ms/op
MultAnd.mult 0.10 avgt 30 33.933 ± 0.887 ms/op
MultAnd.mult 0.20 avgt 30 34.243 ± 0.917 ms/op
MultAnd.mult 0.30 avgt 30 33.825 ± 1.155 ms/op
MultAnd.mult 0.40 avgt 30 34.309 ± 1.334 ms/op
MultAnd.mult 0.45 avgt 30 33.709 ± 1.277 ms/op
MultAnd.mult 0.50 avgt 30 37.699 ± 4.029 ms/op
MultAnd.mult 0.55 avgt 30 36.374 ± 2.948 ms/op
MultAnd.mult 0.60 avgt 30 100.354 ± 1.553 ms/op
MultAnd.mult 0.70 avgt 30 69.570 ± 1.441 ms/op
MultAnd.mult 0.80 avgt 30 47.181 ± 1.567 ms/op
MultAnd.mult 0.90 avgt 30 33.552 ± 0.999 ms/op
MultAnd.mult 1.00 avgt 30 30.775 ± 0.548 ms/op
MultAnd.notOrOr 0.00 avgt 30 15.701 ± 0.254 ms/op
MultAnd.notOrOr 0.10 avgt 30 68.052 ± 1.433 ms/op
MultAnd.notOrOr 0.20 avgt 30 120.393 ± 2.299 ms/op
MultAnd.notOrOr 0.30 avgt 30 177.729 ± 2.438 ms/op
MultAnd.notOrOr 0.40 avgt 30 229.547 ± 1.859 ms/op
MultAnd.notOrOr 0.45 avgt 30 250.660 ± 4.810 ms/op
MultAnd.notOrOr 0.50 avgt 30 258.760 ± 2.190 ms/op
MultAnd.notOrOr 0.55 avgt 30 258.010 ± 1.018 ms/op
MultAnd.notOrOr 0.60 avgt 30 254.732 ± 2.076 ms/op
MultAnd.notOrOr 0.70 avgt 30 227.148 ± 2.040 ms/op
MultAnd.notOrOr 0.80 avgt 30 180.193 ± 4.659 ms/op
MultAnd.notOrOr 0.90 avgt 30 112.212 ± 3.111 ms/op
MultAnd.notOrOr 1.00 avgt 30 33.458 ± 0.952 ms/op
或者用图表表示: