我把Project Euler中的第12题作为一个编程练习,并比较了我在C、Python、Erlang和Haskell中的实现(当然不是最优的)。为了获得更高的执行时间,我搜索第一个因数超过1000的三角形数,而不是原始问题中所述的500。
结果如下:
C:
lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c
lorenzo@enzo:~/erlang$ time ./euler12.bin
842161320
real 0m11.074s
user 0m11.070s
sys 0m0.000s
Python:
lorenzo@enzo:~/erlang$ time ./euler12.py
842161320
real 1m16.632s
user 1m16.370s
sys 0m0.250s
Python与PyPy:
lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py
842161320
real 0m13.082s
user 0m13.050s
sys 0m0.020s
Erlang:
lorenzo@enzo:~/erlang$ erlc euler12.erl
lorenzo@enzo:~/erlang$ time erl -s euler12 solve
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]
Eshell V5.7.4 (abort with ^G)
1> 842161320
real 0m48.259s
user 0m48.070s
sys 0m0.020s
Haskell:
lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx
[1 of 1] Compiling Main ( euler12.hs, euler12.o )
Linking euler12.hsx ...
lorenzo@enzo:~/erlang$ time ./euler12.hsx
842161320
real 2m37.326s
user 2m37.240s
sys 0m0.080s
简介:
C: 100%
Python: 692% (PyPy占118%)
Erlang: 436%(135%归功于RichardC)
Haskell: 1421%
我认为C语言有一个很大的优势,因为它使用长来进行计算,而不是像其他三种那样使用任意长度的整数。它也不需要首先加载运行时(其他的呢?)
问题1:
Erlang, Python和Haskell是否会因为使用任意长度的整数而降低速度,或者只要值小于MAXINT就不会?
问题2:
哈斯克尔为什么这么慢?是否有一个编译器标志关闭刹车或它是我的实现?(后者是很有可能的,因为Haskell对我来说是一本有七个印章的书。)
问题3:
你能否给我一些提示,如何在不改变我确定因素的方式的情况下优化这些实现?以任何方式优化:更好、更快、更“原生”的语言。
编辑:
问题4:
我的函数实现是否允许LCO(最后调用优化,也就是尾递归消除),从而避免在调用堆栈中添加不必要的帧?
虽然我不得不承认我的Haskell和Erlang知识非常有限,但我确实试图用这四种语言实现尽可能相似的相同算法。
使用的源代码:
#include <stdio.h>
#include <math.h>
int factorCount (long n)
{
double square = sqrt (n);
int isquare = (int) square;
int count = isquare == square ? -1 : 0;
long candidate;
for (candidate = 1; candidate <= isquare; candidate ++)
if (0 == n % candidate) count += 2;
return count;
}
int main ()
{
long triangle = 1;
int index = 1;
while (factorCount (triangle) < 1001)
{
index ++;
triangle += index;
}
printf ("%ld\n", triangle);
}
#! /usr/bin/env python3.2
import math
def factorCount (n):
square = math.sqrt (n)
isquare = int (square)
count = -1 if isquare == square else 0
for candidate in range (1, isquare + 1):
if not n % candidate: count += 2
return count
triangle = 1
index = 1
while factorCount (triangle) < 1001:
index += 1
triangle += index
print (triangle)
-module (euler12).
-compile (export_all).
factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0).
factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count;
factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;
factorCount (Number, Sqrt, Candidate, Count) ->
case Number rem Candidate of
0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2);
_ -> factorCount (Number, Sqrt, Candidate + 1, Count)
end.
nextTriangle (Index, Triangle) ->
Count = factorCount (Triangle),
if
Count > 1000 -> Triangle;
true -> nextTriangle (Index + 1, Triangle + Index + 1)
end.
solve () ->
io:format ("~p~n", [nextTriangle (1, 1) ] ),
halt (0).
factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
where square = sqrt $ fromIntegral number
isquare = floor square
factorCount' number sqrt candidate count
| fromIntegral candidate > sqrt = count
| number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2)
| otherwise = factorCount' number sqrt (candidate + 1) count
nextTriangle index triangle
| factorCount triangle > 1000 = triangle
| otherwise = nextTriangle (index + 1) (triangle + index + 1)
main = print $ nextTriangle 1 1
问题1:erlang, python和haskell会因为使用任意长度的整数而降低速度吗?还是只要值小于MAXINT就不会?
This is unlikely. I cannot say much about Erlang and Haskell (well, maybe a bit about Haskell below) but I can point a lot of other bottlenecks in Python. Every time the program tries to execute an operation with some values in Python, it should verify whether the values are from the proper type, and it costs a bit of time. Your factorCount function just allocates a list with range (1, isquare + 1) various times, and runtime, malloc-styled memory allocation is way slower than iterating on a range with a counter as you do in C. Notably, the factorCount() is called multiple times and so allocates a lot of lists. Also, let us not forget that Python is interpreted and the CPython interpreter has no great focus on being optimized.
编辑:哦,好吧,我注意到你使用的是Python 3,所以range()不返回一个列表,而是一个生成器。在这种情况下,我关于分配列表的观点有一半是错误的:该函数只是分配范围对象,尽管效率很低,但没有分配包含很多项的列表那么低。
问题2:为什么haskell这么慢?是否有一个编译器标志关闭刹车或它是我的实现?(后者很有可能,因为haskell对我来说是一本有七个印章的书。)
你在使用Hugs吗?Hugs是一个相当慢的解释器。如果你正在使用它,也许你可以得到一个更好的GHC时间-但我只是在思考假设,这种东西,一个好的Haskell编译器做的是非常迷人的,远远超出我的理解:)
问题3:你能给我一些提示吗?如何在不改变我确定因素的方式的情况下优化这些实现?以任何方式优化:更好、更快、更“原生”的语言。
我得说你在玩一场不好笑的游戏。了解各种语言最好的部分是尽可能以不同的方式使用它们:)但我离题了,我只是对这一点没有任何建议。对不起,我希望有人能在这种情况下帮助你:)
问题4:我的函数实现是否允许LCO,从而避免在调用堆栈中添加不必要的帧?
据我所知,您只需要确保您的递归调用是返回值之前的最后一个命令。换句话说,像下面这样的函数可以使用这样的优化:
def factorial(n, acc=1):
if n > 1:
acc = acc * n
n = n - 1
return factorial(n, acc)
else:
return acc
然而,如果你的函数如下所示,你就不会有这样的优化,因为在递归调用之后有一个操作(乘法):
def factorial2(n):
if n > 1:
f = factorial2(n-1)
return f*n
else:
return 1
我将操作分隔在一些局部变量中,以便明确执行哪些操作。然而,最常见的是看到这些函数如下所示,但它们对于我所说的观点是等价的:
def factorial(n, acc=1):
if n > 1:
return factorial(n-1, acc*n)
else:
return acc
def factorial2(n):
if n > 1:
return n*factorial(n-1)
else:
return 1
注意,这是由编译器/解释器来决定是否进行尾递归。例如,如果我记得很清楚,Python解释器就不会这样做(我在示例中使用Python只是因为它的语法流畅)。不管怎样,如果你发现了一些奇怪的东西,比如带两个参数的阶乘函数(其中一个参数有acc, accumulator等名称),现在你知道为什么人们这样做了:)
在x86_64 Core2 Duo (2.5GHz)机器上使用GHC 7.0.3, gcc 4.4.6, Linux 2.6.29,对Haskell使用GHC -O2 - flvm - force-recomp编译,对C使用gcc -O3 -lm编译。
Your C routine runs in 8.4 seconds (faster than your run probably because of -O3)
The Haskell solution runs in 36 seconds (due to the -O2 flag)
Your factorCount' code isn't explicitly typed and defaulting to Integer (thanks to Daniel for correcting my misdiagnosis here!). Giving an explicit type signature (which is standard practice anyway) using Int and the time changes to 11.1 seconds
in factorCount' you have needlessly called fromIntegral. A fix results in no change though (the compiler is smart, lucky for you).
You used mod where rem is faster and sufficient. This changes the time to 8.5 seconds.
factorCount' is constantly applying two extra arguments that never change (number, sqrt). A worker/wrapper transformation gives us:
$ time ./so
842161320
real 0m7.954s
user 0m7.944s
sys 0m0.004s
没错,7.95秒。始终比C方案快半秒。没有- flvm标志,我仍然得到8.182秒,所以NCG后端在这种情况下也做得很好。
结论:Haskell非常棒。
生成的代码
factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
where square = sqrt $ fromIntegral number
isquare = floor square
factorCount' :: Int -> Int -> Int -> Int -> Int
factorCount' number sqrt candidate0 count0 = go candidate0 count0
where
go candidate count
| candidate > sqrt = count
| number `rem` candidate == 0 = go (candidate + 1) (count + 2)
| otherwise = go (candidate + 1) count
nextTriangle index triangle
| factorCount triangle > 1000 = triangle
| otherwise = nextTriangle (index + 1) (triangle + index + 1)
main = print $ nextTriangle 1 1
编辑:现在我们已经探讨了这个问题,让我们来解决问题
问题1:erlang、python和haskell是否会因为使用
任意长度的整数,只要值更小
比MAXINT ?
在Haskell中,使用Integer比Int慢,但慢多少取决于执行的计算。幸运的是(对于64位机器)Int就足够了。出于可移植性的考虑,你可能应该重写我的代码,使用Int64或Word64 (C不是唯一的语言长)。
问题2:为什么haskell这么慢?有编译器标志吗
关闭刹车还是我的实现?(后者相当
就像haskell对我来说是一本有七个印章的书一样。)
问题3:你能给我一些建议吗
实现而不改变我确定因子的方式?
以任何方式优化:更好、更快、更“原生”的语言。
这就是我上面所回答的。答案是
0)通过-O2进行优化
1)尽可能使用快速(特别是不可装箱的)类型
2) rem not mod(一个经常被遗忘的优化)和
3)工人/包装器转换(可能是最常见的优化)。
问题4:我的功能实现是否允许LCO,因此
避免添加不必要的帧到调用堆栈?
是的,这不是问题所在。干得好,很高兴你考虑到这一点。
Erlang实现存在一些问题。作为下面的基准,我测量的未修改的Erlang程序的执行时间为47.6秒,而C代码的执行时间为12.7秒。
(编辑:在Erlang/OTP版本24,2021年,Erlang有一个自动JIT编译器,旧的+本机编译器选项不再支持或需要。我保留下面这段文字作为历史文件。关于export_all的注释对于jit生成良好代码的能力仍然是有效的。)
The first thing you should do if you want to run computationally intensive Erlang code is to use native code. Compiling with erlc +native euler12 got the time down to 41.3 seconds. This is however a much lower speedup (just 15%) than expected from native compilation on this kind of code, and the problem is your use of -compile(export_all). This is useful for experimentation, but the fact that all functions are potentially reachable from the outside causes the native compiler to be very conservative. (The normal BEAM emulator is not that much affected.) Replacing this declaration with -export([solve/0]). gives a much better speedup: 31.5 seconds (almost 35% from the baseline).
但是代码本身有一个问题:对于factorCount循环中的每一次迭代,都要执行以下测试:
factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;
C代码不这样做。一般来说,在相同代码的不同实现之间进行公平的比较是很棘手的,特别是如果算法是数值的,因为您需要确保它们实际上在做相同的事情。在某个实现中由于某个类型转换而产生的轻微舍入错误可能会导致它比另一个实现进行更多的迭代,即使两者最终得到相同的结果。
为了消除这个可能的错误源(并在每次迭代中摆脱额外的测试),我重写了factorCount函数,如下所示,密切模仿C代码:
factorCount (N) ->
Sqrt = math:sqrt (N),
ISqrt = trunc(Sqrt),
if ISqrt == Sqrt -> factorCount (N, ISqrt, 1, -1);
true -> factorCount (N, ISqrt, 1, 0)
end.
factorCount (_N, ISqrt, Candidate, Count) when Candidate > ISqrt -> Count;
factorCount ( N, ISqrt, Candidate, Count) ->
case N rem Candidate of
0 -> factorCount (N, ISqrt, Candidate + 1, Count + 2);
_ -> factorCount (N, ISqrt, Candidate + 1, Count)
end.
这个重写,没有export_all和本机编译,给了我以下运行时:
$ erlc +native euler12.erl
$ time erl -noshell -s euler12 solve
842161320
real 0m19.468s
user 0m19.450s
sys 0m0.010s
这与C代码相比不算太糟:
$ time ./a.out
842161320
real 0m12.755s
user 0m12.730s
sys 0m0.020s
考虑到Erlang完全不适合编写数字代码,在这样的程序中只比C慢50%就已经很不错了。
最后,关于你的问题:
问题1:erlang、python和haskell是否会因为使用任意长度的整数而降低速度
只要值小于MAXINT,它们不就行了吗?
Yes, somewhat. In Erlang, there is no way of saying "use 32/64-bit arithmetic with wrap-around", so unless the compiler can prove some bounds on your integers (and it usually can't), it must check all computations to see if they can fit in a single tagged word or if it has to turn them into heap-allocated bignums. Even if no bignums are ever used in practice at runtime, these checks will have to be performed. On the other hand, that means you know that the algorithm will never fail because of an unexpected integer wraparound if you suddenly give it larger inputs than before.
问题4:我的函数实现是否允许LCO,从而避免在调用堆栈中添加不必要的帧?
是的,您的Erlang代码在最后调用优化方面是正确的。
在Python优化方面,除了使用PyPy(对代码进行零更改即可获得令人印象深刻的加速)之外,还可以使用PyPy的翻译工具链编译与rpython兼容的版本,或者使用Cython构建扩展模块,在我的测试中,这两种工具都比C版本快,而Cython模块的速度几乎是C版本的两倍。作为参考,我包括C和PyPy基准测试结果:
C(编译gcc -O3 -lm)
% time ./euler12-c
842161320
./euler12-c 11.95s
user 0.00s
system 99%
cpu 11.959 total
PyPy 1.5
% time pypy euler12.py
842161320
pypy euler12.py
16.44s user
0.01s system
99% cpu 16.449 total
RPython(使用最新的PyPy修订版,c2f583445aee)
% time ./euler12-rpython-c
842161320
./euler12-rpy-c
10.54s user 0.00s
system 99%
cpu 10.540 total
崇拜0.15
% time python euler12-cython.py
842161320
python euler12-cython.py
6.27s user 0.00s
system 99%
cpu 6.274 total
RPython版本有几个关键的变化。要转换成一个独立的程序,您需要定义目标,在本例中是主函数。它被期望接受sys。Argv作为它唯一的参数,并且需要返回一个int。你可以使用translate.py, % translate.py euler12-rpython.py来翻译它,它可以翻译成C语言并为你编译它。
# euler12-rpython.py
import math, sys
def factorCount(n):
square = math.sqrt(n)
isquare = int(square)
count = -1 if isquare == square else 0
for candidate in xrange(1, isquare + 1):
if not n % candidate: count += 2
return count
def main(argv):
triangle = 1
index = 1
while factorCount(triangle) < 1001:
index += 1
triangle += index
print triangle
return 0
if __name__ == '__main__':
main(sys.argv)
def target(*args):
return main, None
Cython版本被重写为扩展模块_euler12。我从一个普通的python文件中导入并调用它。_euler12。Pyx本质上与您的版本相同,只是有一些额外的静态类型声明。setup.py有一个正常的样板来构建扩展,使用python setup.py build_ext——inplace。
# _euler12.pyx
from libc.math cimport sqrt
cdef int factorCount(int n):
cdef int candidate, isquare, count
cdef double square
square = sqrt(n)
isquare = int(square)
count = -1 if isquare == square else 0
for candidate in range(1, isquare + 1):
if not n % candidate: count += 2
return count
cpdef main():
cdef int triangle = 1, index = 1
while factorCount(triangle) < 1001:
index += 1
triangle += index
print triangle
# euler12-cython.py
import _euler12
_euler12.main()
# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules = [Extension("_euler12", ["_euler12.pyx"])]
setup(
name = 'Euler12-Cython',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)
老实说,我对RPython或Cython都没有什么经验,对结果感到惊喜。如果您正在使用CPython,那么在Cython扩展模块中编写cpu密集型代码似乎是优化程序的一种非常简单的方法。
问题1:Erlang、Python和Haskell是否会因为使用
任意长度的整数,只要值更小
比MAXINT ?
对于Erlang,第一个问题的答案是否定的。最后一个问题可以通过适当地使用Erlang来回答,如下所示:
http://bredsaal.dk/learning-erlang-using-projecteuler-net
由于它比您最初的C示例要快,我猜它会有很多问题,因为其他人已经详细讨论过了。
这个Erlang模块在一个便宜的上网本上执行大约5秒…它使用erlang中的网络线程模型,并演示了如何利用事件模型。它可以分布在许多节点上。而且速度很快。不是我的代码。
-module(p12dist).
-author("Jannich Brendle, jannich@bredsaal.dk, http://blog.bredsaal.dk").
-compile(export_all).
server() ->
server(1).
server(Number) ->
receive {getwork, Worker_PID} -> Worker_PID ! {work,Number,Number+100},
server(Number+101);
{result,T} -> io:format("The result is: \~w.\~n", [T]);
_ -> server(Number)
end.
worker(Server_PID) ->
Server_PID ! {getwork, self()},
receive {work,Start,End} -> solve(Start,End,Server_PID)
end,
worker(Server_PID).
start() ->
Server_PID = spawn(p12dist, server, []),
spawn(p12dist, worker, [Server_PID]),
spawn(p12dist, worker, [Server_PID]),
spawn(p12dist, worker, [Server_PID]),
spawn(p12dist, worker, [Server_PID]).
solve(N,End,_) when N =:= End -> no_solution;
solve(N,End,Server_PID) ->
T=round(N*(N+1)/2),
case (divisor(T,round(math:sqrt(T))) > 500) of
true ->
Server_PID ! {result,T};
false ->
solve(N+1,End,Server_PID)
end.
divisors(N) ->
divisor(N,round(math:sqrt(N))).
divisor(_,0) -> 1;
divisor(N,I) ->
case (N rem I) =:= 0 of
true ->
2+divisor(N,I-1);
false ->
divisor(N,I-1)
end.
下面的测试发生在Intel(R) Atom(TM) CPU N270 @ 1.60GHz上
~$ time erl -noshell -s p12dist start
The result is: 76576500.
^C
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
(v)ersion (k)ill (D)b-tables (d)istribution
a
real 0m5.510s
user 0m5.836s
sys 0m0.152s
只是为了好玩。下面是一个更“原生”的Haskell实现:
import Control.Applicative
import Control.Monad
import Data.Either
import Math.NumberTheory.Powers.Squares
isInt :: RealFrac c => c -> Bool
isInt = (==) <$> id <*> fromInteger . round
intSqrt :: (Integral a) => a -> Int
--intSqrt = fromIntegral . floor . sqrt . fromIntegral
intSqrt = fromIntegral . integerSquareRoot'
factorize :: Int -> [Int]
factorize 1 = []
factorize n = first : factorize (quot n first)
where first = (!! 0) $ [a | a <- [2..intSqrt n], rem n a == 0] ++ [n]
factorize2 :: Int -> [(Int,Int)]
factorize2 = foldl (\ls@((val,freq):xs) y -> if val == y then (val,freq+1):xs else (y,1):ls) [(0,0)] . factorize
numDivisors :: Int -> Int
numDivisors = foldl (\acc (_,y) -> acc * (y+1)) 1 <$> factorize2
nextTriangleNumber :: (Int,Int) -> (Int,Int)
nextTriangleNumber (n,acc) = (n+1,acc+n+1)
forward :: Int -> (Int, Int) -> Either (Int, Int) (Int, Int)
forward k val@(n,acc) = if numDivisors acc > k then Left val else Right (nextTriangleNumber val)
problem12 :: Int -> (Int, Int)
problem12 n = (!!0) . lefts . scanl (>>=) (forward n (1,1)) . repeat . forward $ n
main = do
let (n,val) = problem12 1000
print val
使用ghc -O3,它在我的机器上持续运行0.55-0.58秒(1.73GHz Core i7)。
C版本中一个更有效的factorCount函数:
int factorCount (int n)
{
int count = 1;
int candidate,tmpCount;
while (n % 2 == 0) {
count++;
n /= 2;
}
for (candidate = 3; candidate < n && candidate * candidate < n; candidate += 2)
if (n % candidate == 0) {
tmpCount = 1;
do {
tmpCount++;
n /= candidate;
} while (n % candidate == 0);
count*=tmpCount;
}
if (n > 1)
count *= 2;
return count;
}
在main中使用gcc -O3 -lm将long类型更改为int类型,该程序始终在0.31-0.35秒内运行。
如果您利用第n个三角形数= n*(n+1)/2,并且n和(n+1)具有完全不同的质因数分解,则可以使两者运行得更快,因此可以将每个一半的因数数相乘,以得到整体的因数数。以下几点:
int main ()
{
int triangle = 0,count1,count2 = 1;
do {
count1 = count2;
count2 = ++triangle % 2 == 0 ? factorCount(triangle+1) : factorCount((triangle+1)/2);
} while (count1*count2 < 1001);
printf ("%lld\n", ((long long)triangle)*(triangle+1)/2);
}
将c代码的运行时间减少到0.17-0.19秒,它可以处理更大的搜索——大于10000个因数在我的机器上大约需要43秒。我给感兴趣的读者留下了类似的haskell加速。
更多关于C版本的数字和解释。显然这么多年来没人这么做过。记得给这个答案点赞,这样它就可以放在最上面,让每个人都能看到和学习。
第一步:作者程序的基准
笔记本电脑的规格:
CPU i3 M380 (931 MHz -最大省电模式)
4 gb内存
Win7 64位
微软Visual Studio 2012终极版
Cygwin与gcc 4.9.3
Python 2.7.10
命令:
compiling on VS x64 command prompt > `for /f %f in ('dir /b *.c') do cl /O2 /Ot /Ox %f -o %f_x64_vs2012.exe`
compiling on cygwin with gcc x64 > `for f in ./*.c; do gcc -m64 -O3 $f -o ${f}_x64_gcc.exe ; done`
time (unix tools) using cygwin > `for f in ./*.exe; do echo "----------"; echo $f ; time $f ; done`
.
----------
$ time python ./original.py
real 2m17.748s
user 2m15.783s
sys 0m0.093s
----------
$ time ./original_x86_vs2012.exe
real 0m8.377s
user 0m0.015s
sys 0m0.000s
----------
$ time ./original_x64_vs2012.exe
real 0m8.408s
user 0m0.000s
sys 0m0.015s
----------
$ time ./original_x64_gcc.exe
real 0m20.951s
user 0m20.732s
sys 0m0.030s
文件名为:integertype_architecture_compiler.exe
Integertype目前与原始程序相同(稍后详细介绍)
架构是x86或x64,取决于编译器设置
编译器是GCC或vs2012
第二步:调查、改进和再次基准
VS比gcc快250%。这两个编译器应该给出类似的速度。显然,代码或编译器选项有问题。让我们调查!
首先要注意的是整数类型。转换可能很昂贵,一致性对于更好的代码生成和优化很重要。所有整数都应该是相同的类型。
它现在是int和long的混合体。我们要改进这一点。使用哪种类型?最快的。必须对它们进行基准测试!
----------
$ time ./int_x86_vs2012.exe
real 0m8.440s
user 0m0.016s
sys 0m0.015s
----------
$ time ./int_x64_vs2012.exe
real 0m8.408s
user 0m0.016s
sys 0m0.015s
----------
$ time ./int32_x86_vs2012.exe
real 0m8.408s
user 0m0.000s
sys 0m0.015s
----------
$ time ./int32_x64_vs2012.exe
real 0m8.362s
user 0m0.000s
sys 0m0.015s
----------
$ time ./int64_x86_vs2012.exe
real 0m18.112s
user 0m0.000s
sys 0m0.015s
----------
$ time ./int64_x64_vs2012.exe
real 0m18.611s
user 0m0.000s
sys 0m0.015s
----------
$ time ./long_x86_vs2012.exe
real 0m8.393s
user 0m0.015s
sys 0m0.000s
----------
$ time ./long_x64_vs2012.exe
real 0m8.440s
user 0m0.000s
sys 0m0.015s
----------
$ time ./uint32_x86_vs2012.exe
real 0m8.362s
user 0m0.000s
sys 0m0.015s
----------
$ time ./uint32_x64_vs2012.exe
real 0m8.393s
user 0m0.015s
sys 0m0.015s
----------
$ time ./uint64_x86_vs2012.exe
real 0m15.428s
user 0m0.000s
sys 0m0.015s
----------
$ time ./uint64_x64_vs2012.exe
real 0m15.725s
user 0m0.015s
sys 0m0.015s
----------
$ time ./int_x64_gcc.exe
real 0m8.531s
user 0m8.329s
sys 0m0.015s
----------
$ time ./int32_x64_gcc.exe
real 0m8.471s
user 0m8.345s
sys 0m0.000s
----------
$ time ./int64_x64_gcc.exe
real 0m20.264s
user 0m20.186s
sys 0m0.015s
----------
$ time ./long_x64_gcc.exe
real 0m20.935s
user 0m20.809s
sys 0m0.015s
----------
$ time ./uint32_x64_gcc.exe
real 0m8.393s
user 0m8.346s
sys 0m0.015s
----------
$ time ./uint64_x64_gcc.exe
real 0m16.973s
user 0m16.879s
sys 0m0.030s
整数类型是int long int32_t uint32_t int64_t和uint64_t from #include <stdint.h>
C语言中有很多整数类型,还有一些带符号/无符号的可以使用,还有编译为x86或x64的选择(不要与实际的整数大小混淆)。要编译和运行^^的版本太多了
第三步:理解数字
最终结论:
32位整数比64位整数快200%
无符号64位整数比有符号64位快25%(不幸的是,我对此没有解释)
陷阱问题:“C语言中int和long的大小是多少?”
正确答案是:C中int和long的大小没有很好的定义!
来自C规范:
Int至少是32位
Long至少是int型
从gcc手册页(-m32和-m64标志):
32位环境将int、long和指针设置为32位,并生成可在任何i386系统上运行的代码。
64位环境将int设置为32位,long设置为64位,指针设置为64位,并为AMD的x86-64架构生成代码。
来自MSDN文档(数据类型范围)https://msdn.microsoft.com/en-us/library/s3f49ktz%28v=vs.110%29.aspx:
Int, 4字节,也是有符号的
Long, 4字节,也称为Long int和带符号的Long int
总结一下:吸取的教训
32位整数比64位整数快。
标准整数类型在C和c++中都没有很好地定义,它们取决于编译器和体系结构。当你需要一致性和可预测性时,使用uint32_t整数族从#include <stdint.h>。
速度问题解决。所有其他语言都落后百分之百,C和c++又赢了!他们总是这样。接下来的改进将是使用OpenMP:D进行多线程处理
尝试:
package main
import "fmt"
import "math"
func main() {
var n, m, c int
for i := 1; ; i++ {
n, m, c = i * (i + 1) / 2, int(math.Sqrt(float64(n))), 0
for f := 1; f < m; f++ {
if n % f == 0 { c++ }
}
c *= 2
if m * m == n { c ++ }
if c > 1001 {
fmt.Println(n)
break
}
}
}
我得到:
原始版本:9.1690 100%
Go: 8.2520 111%
但使用:
package main
import (
"math"
"fmt"
)
// Sieve of Eratosthenes
func PrimesBelow(limit int) []int {
switch {
case limit < 2:
return []int{}
case limit == 2:
return []int{2}
}
sievebound := (limit - 1) / 2
sieve := make([]bool, sievebound+1)
crosslimit := int(math.Sqrt(float64(limit))-1) / 2
for i := 1; i <= crosslimit; i++ {
if !sieve[i] {
for j := 2 * i * (i + 1); j <= sievebound; j += 2*i + 1 {
sieve[j] = true
}
}
}
plimit := int(1.3*float64(limit)) / int(math.Log(float64(limit)))
primes := make([]int, plimit)
p := 1
primes[0] = 2
for i := 1; i <= sievebound; i++ {
if !sieve[i] {
primes[p] = 2*i + 1
p++
if p >= plimit {
break
}
}
}
last := len(primes) - 1
for i := last; i > 0; i-- {
if primes[i] != 0 {
break
}
last = i
}
return primes[0:last]
}
func main() {
fmt.Println(p12())
}
// Requires PrimesBelow from utils.go
func p12() int {
n, dn, cnt := 3, 2, 0
primearray := PrimesBelow(1000000)
for cnt <= 1001 {
n++
n1 := n
if n1%2 == 0 {
n1 /= 2
}
dn1 := 1
for i := 0; i < len(primearray); i++ {
if primearray[i]*primearray[i] > n1 {
dn1 *= 2
break
}
exponent := 1
for n1%primearray[i] == 0 {
exponent++
n1 /= primearray[i]
}
if exponent > 1 {
dn1 *= exponent
}
if n1 == 1 {
break
}
}
cnt = dn * dn1
dn = dn1
}
return n * (n - 1) / 2
}
我得到:
原始版本:9.1690 100%
Thaumkid的c版本:0.1060 8650%
首发版本:8.2520 111%
第二围棋版本:0.0230 39865%
我还尝试了Python3.6和pypy3.3-5.5-alpha:
原版本:8.629 100%
Thaumkid的c版本:0.109 7916%
python: 54.795 16%
Pypy3.3-5.5-alpha: 13.291 65%
然后用下面的代码我得到:
原版本:8.629 100%
Thaumkid的c版本:0.109 8650%
Python3.6: 1.489 580%
Pypy3.3-5.5-alpha: 0.582 1483%
def D(N):
if N == 1: return 1
sqrtN = int(N ** 0.5)
nf = 1
for d in range(2, sqrtN + 1):
if N % d == 0:
nf = nf + 1
return 2 * nf - (1 if sqrtN**2 == N else 0)
L = 1000
Dt, n = 0, 0
while Dt <= L:
t = n * (n + 1) // 2
Dt = D(n/2)*D(n+1) if n%2 == 0 else D(n)*D((n+1)/2)
n = n + 1
print (t)