因为MATLAB最初是为数值线性代数(矩阵操作)开发的编程语言,它有专门为矩阵乘法开发的库。现在MATLAB也可以使用图形处理器(图形处理单元)来实现这一点。
如果我们看一下计算结果:
2048x2048 4096x4096
--------- --------- ---------
CUDA C (ms) 43.11 391.05 3407.99
c++ (ms) 6137.10 64369.29 551390.93
c# (ms) 10509.00 300684.00 2527250.00
Java (ms) 9149.90 92562.28 838357.94
MATLAB (ms) 75.01 423.10 3133.90
然后我们可以看到不仅MATLAB在矩阵乘法方面如此之快:CUDA C(来自NVIDIA的编程语言)有一些比MATLAB更好的结果。CUDA C也有专门为矩阵乘法开发的库,它使用gpu。
MATLAB简史
Cleve Moler, the chairman of the computer science department at the University of New Mexico, started developing MATLAB in the late 1970s. He designed it to give his students access to LINPACK (a software library for performing numerical linear algebra) and EISPACK (is a software library for numerical computation of linear algebra) without them having to learn Fortran. It soon spread to other universities and found a strong audience within the applied mathematics community. Jack Little, an engineer, was exposed to it during a visit Moler made to Stanford University in 1983. Recognizing its commercial potential, he joined with Moler and Steve Bangert. They rewrote MATLAB in C and founded MathWorks in 1984 to continue its development. These rewritten libraries were known as JACKPAC. In 2000, MATLAB was rewritten to use a newer set of libraries for matrix manipulation, LAPACK (is a standard software library for numerical linear algebra).
Source
什么是CUDA C
CUDA C还使用专门为矩阵乘法开发的库,如OpenGL(开放图形库)。它还使用GPU和Direct3D(在MS Windows上)。
The CUDA platform is designed to work with programming languages such as C, C++, and Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills in graphics programming. Also, CUDA supports programming frameworks such as OpenACC and OpenCL.
Example of CUDA processing flow:
Copy data from main memory to GPU memory
CPU initiates the GPU compute kernel
GPU's CUDA cores execute the kernel in parallel
Copy the resulting data from GPU memory to main memory
比较CPU和GPU的执行速度
We ran a benchmark in which we measured the amount of time it took to execute 50 time steps for grid sizes of 64, 128, 512, 1024, and 2048 on an Intel Xeon Processor X5650 and then using an NVIDIA Tesla C2050 GPU.
For a grid size of 2048, the algorithm shows a 7.5x decrease in compute time from more than a minute on the CPU to less than 10 seconds on the GPU. The log scale plot shows that the CPU is actually faster for small grid sizes. As the technology evolves and matures, however, GPU solutions are increasingly able to handle smaller problems, a trend that we expect to continue.
Source
来自CUDA C编程指南的介绍:
Driven by the insatiable market demand for realtime, high-definition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth, as illustrated by Figure 1 and Figure 2.
Figure 1. Floating-Point Operations per Second for the CPU and GPU
Figure 2. Memory Bandwidth for the CPU and GPU
The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 3.
Figure 3. The GPU Devotes More Transistors to Data Processing
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.
Source
先进的阅读
图形处理器(图形处理器)
MATLAB
C编程指南
在MATLAB中使用gpu
基本线性代数子程序(BLAS)
《解析高性能矩阵乘法》,Kazushige Goto和Robert A. Van De Geijn著
一些有趣的面孔
我写过c++矩阵乘法,它和Matlab一样快,但它需要一些小心。(在Matlab使用图形处理器之前)。
Сitation从这个答案。