例如,乘法和除法可以使用位运算符来实现
i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)
等等。
用say (i<<3)+(i<<1)乘10是否比直接用i*10更快?有没有什么输入是不能用这种方法乘或除的?
例如,乘法和除法可以使用位运算符来实现
i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)
等等。
用say (i<<3)+(i<<1)乘10是否比直接用i*10更快?有没有什么输入是不能用这种方法乘或除的?
当前回答
Python测试对相同的随机数执行相同的乘法1亿次。
>>> from timeit import timeit
>>> setup_str = 'import scipy; from scipy import random; scipy.random.seed(0)'
>>> N = 10*1000*1000
>>> timeit('x=random.randint(65536);', setup=setup_str, number=N)
1.894096851348877 # Time from generating the random #s and no opperati
>>> timeit('x=random.randint(65536); x*2', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); x << 1', setup=setup_str, number=N)
2.2616429328918457
>>> timeit('x=random.randint(65536); x*10', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); (x << 3) + (x<<1)', setup=setup_str, number=N)
2.9485139846801758
>>> timeit('x=random.randint(65536); x // 2', setup=setup_str, number=N)
2.490908145904541
>>> timeit('x=random.randint(65536); x / 2', setup=setup_str, number=N)
2.4757170677185059
>>> timeit('x=random.randint(65536); x >> 1', setup=setup_str, number=N)
2.2316000461578369
因此,在python中做移位而不是用2的幂来做乘法/除法,会有轻微的改进(~10%用于除法;~1%的乘法)。如果它不是2的幂,可能会有相当大的放缓。
同样,这些#将根据你的处理器、编译器(或解释器——为了简单起见,在python中这样做)而改变。
和其他人一样,不要过早地优化。编写可读性非常强的代码,如果不够快就进行分析,然后尝试优化慢的部分。请记住,编译器在优化方面比您做得更好。
其他回答
有些优化编译器无法做到,因为它们只适用于减少的输入集。
下面是c++示例代码,可以执行更快的除法,执行64位“乘倒数”。分子和分母都必须低于某个阈值。注意,它必须被编译为使用64位指令才能比普通除法更快。
#include <stdio.h>
#include <chrono>
static const unsigned s_bc = 32;
static const unsigned long long s_p = 1ULL << s_bc;
static const unsigned long long s_hp = s_p / 2;
static unsigned long long s_f;
static unsigned long long s_fr;
static void fastDivInitialize(const unsigned d)
{
s_f = s_p / d;
s_fr = s_f * (s_p - (s_f * d));
}
static unsigned fastDiv(const unsigned n)
{
return (s_f * n + ((s_fr * n + s_hp) >> s_bc)) >> s_bc;
}
static bool fastDivCheck(const unsigned n, const unsigned d)
{
// 32 to 64 cycles latency on modern cpus
const unsigned expected = n / d;
// At least 10 cycles latency on modern cpus
const unsigned result = fastDiv(n);
if (result != expected)
{
printf("Failed for: %u/%u != %u\n", n, d, expected);
return false;
}
return true;
}
int main()
{
unsigned result = 0;
// Make sure to verify it works for your expected set of inputs
const unsigned MAX_N = 65535;
const unsigned MAX_D = 40000;
const double ONE_SECOND_COUNT = 1000000000.0;
auto t0 = std::chrono::steady_clock::now();
unsigned count = 0;
printf("Verifying...\n");
for (unsigned d = 1; d <= MAX_D; ++d)
{
fastDivInitialize(d);
for (unsigned n = 0; n <= MAX_N; ++n)
{
count += !fastDivCheck(n, d);
}
}
auto t1 = std::chrono::steady_clock::now();
printf("Errors: %u / %u (%.4fs)\n", count, MAX_D * (MAX_N + 1), (t1 - t0).count() / ONE_SECOND_COUNT);
t0 = t1;
for (unsigned d = 1; d <= MAX_D; ++d)
{
fastDivInitialize(d);
for (unsigned n = 0; n <= MAX_N; ++n)
{
result += fastDiv(n);
}
}
t1 = std::chrono::steady_clock::now();
printf("Fast division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);
t0 = t1;
count = 0;
for (unsigned d = 1; d <= MAX_D; ++d)
{
for (unsigned n = 0; n <= MAX_N; ++n)
{
result += n / d;
}
}
t1 = std::chrono::steady_clock::now();
printf("Normal division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);
getchar();
return result;
}
这取决于处理器和编译器。一些编译器已经通过这种方式优化代码了,其他的还没有。 因此,每次需要以这种方式优化代码时,您都需要检查。
除非您迫切需要优化,否则我不会为了节省汇编指令或处理器周期而打乱源代码。
In the case of signed integers and right shift vs division, it can make a difference. For negative numbers, the shift rounds rounds towards negative infinity whereas division rounds towards zero. Of course the compiler will change the division to something cheaper, but it will usually change it to something that has the same rounding behavior as division, because it is either unable to prove that the variable won't be negative or it simply doesn't care. So if you can prove that a number won't be negative or if you don't care which way it will round, you can do that optimization in a way that is more likely to make a difference.
简单回答:不太可能。
长一点的回答: 你的编译器有一个优化器,它知道如何像你的目标处理器体系结构一样快速地进行乘法运算。最好的办法是清楚地告诉编译器你的意图(即i*2而不是i << 1),让它决定最快的汇编/机器码序列是什么。甚至有可能处理器本身已经将乘法指令实现为微码中的移位和加法序列。
总之,不要花太多时间担心这个。如果你想换,那就换。如果你想乘,那就乘。做语义上最清楚的事情——你的同事以后会感谢你的。或者,更有可能的是,如果你不这样做,之后会诅咒你。
刚刚在我的机器上编译了这个:
int a = ...;
int b = a * 10;
当分解它时会产生输出:
MOV EAX,DWORD PTR SS:[ESP+1C] ; Move a into EAX
LEA EAX,DWORD PTR DS:[EAX+EAX*4] ; Multiply by 5 without shift !
SHL EAX, 1 ; Multiply by 2 using shift
这个版本比纯移位和加法的手工优化代码更快。
你永远不知道编译器会得到什么,所以最好只是简单地写一个普通的乘法,让它按自己想要的方式优化,除非在非常精确的情况下,你知道编译器无法优化。