我用CUDA, c++, c#, Java做了一些基准测试,并使用MATLAB进行验证和矩阵生成。当我用MATLAB执行矩阵乘法时,2048x2048甚至更大的矩阵几乎立即被相乘。

             1024x1024   2048x2048   4096x4096
             ---------   ---------   ---------
CUDA C (ms)      43.11      391.05     3407.99
C++ (ms)       6137.10    64369.29   551390.93
C# (ms)       10509.00   300684.00  2527250.00
Java (ms)      9149.90    92562.28   838357.94
MATLAB (ms)      75.01      423.10     3133.90

只有CUDA是有竞争力的,但我认为至少c++会有点接近,而不是慢60倍。我也不知道如何看待c#的结果。算法与c++和Java一样,但从1024年到2048年有了巨大的飞跃。

MATLAB是如何如此快速地执行矩阵乘法的?

c++代码:

float temp = 0;
timer.start();
for(int j = 0; j < rozmer; j++)
{
    for (int k = 0; k < rozmer; k++)
    {
        temp = 0;
        for (int m = 0; m < rozmer; m++)
        {
            temp = temp + matice1[j][m] * matice2[m][k];
        }
        matice3[j][k] = temp;
    }
}
timer.stop();

当前回答

这种问题反复出现,在Stack Overflow上应该比“MATLAB使用高度优化的库”或“MATLAB使用MKL”更清楚地回答。

历史:

矩阵乘法(连同矩阵-向量、向量-向量乘法和许多矩阵分解)是线性代数中最重要的问题。工程师们从早期开始就一直在用计算机解决这些问题。

I'm not an expert on the history, but apparently back then, everybody just rewrote his FORTRAN version with simple loops. Some standardization then came along, with the identification of "kernels" (basic routines) that most linear algebra problems needed in order to be solved. These basic operations were then standardized in a specification called: Basic Linear Algebra Subprograms (BLAS). Engineers could then call these standard, well-tested BLAS routines in their code, making their work much easier.

布拉斯特区:

BLAS从第1级(定义标量-向量和向量-向量运算的第一个版本)发展到第2级(向量-矩阵运算),再到第3级(矩阵-矩阵运算),并提供了越来越多的“核心”,从而标准化了越来越多的基本线性代数运算。最初的FORTRAN 77实现仍然可以在Netlib的网站上找到。

为了更好的表现:

因此,多年来(特别是在BLAS级别1和级别2发布之间:80年代初),随着矢量操作和缓存层次结构的出现,硬件发生了变化。这些演进使得大幅度提高BLAS子例程的性能成为可能。然后,不同的供应商也推出了他们越来越高效的BLAS例程实现。

我不知道所有历史上的实现(那时候我还没出生,也不是孩子),但最著名的两个实现出现在21世纪初:英特尔MKL和GotoBLAS。你的Matlab使用的是英特尔MKL,这是一个非常好的,优化的BLAS,这就解释了你所看到的出色性能。

矩阵乘法的技术细节:

那么为什么Matlab (MKL)在dgemm(双精度一般矩阵-矩阵乘法)上如此之快?简单来说:因为它使用了向量化和良好的数据缓存。用更复杂的术语来说:请参阅乔纳森•摩尔(Jonathan Moore)提供的文章。

Basically, when you perform your multiplication in the C++ code you provided, you are not at all cache-friendly. Since I suspect you created an array of pointers to row arrays, your accesses in your inner loop to the k-th column of "matice2": matice2[m][k] are very slow. Indeed, when you access matice2[0][k], you must get the k-th element of the array 0 of your matrix. Then in the next iteration, you must access matice2[1][k], which is the k-th element of another array (the array 1). Then in the next iteration you access yet another array, and so on... Since the entire matrix matice2 can't fit in the highest caches (it's 8*1024*1024 bytes large), the program must fetch the desired element from main memory, losing a lot of time.

如果您只是调换了矩阵的位置,以便访问相邻的内存地址,那么您的代码将运行得更快,因为现在编译器可以同时在缓存中加载整个行。试试这个修改过的版本:

timer.start();
float temp = 0;
//transpose matice2
for (int p = 0; p < rozmer; p++)
{
    for (int q = 0; q < rozmer; q++)
    {
        tempmat[p][q] = matice2[q][p];
    }
}
for(int j = 0; j < rozmer; j++)
{
    for (int k = 0; k < rozmer; k++)
    {
        temp = 0;
        for (int m = 0; m < rozmer; m++)
        {
            temp = temp + matice1[j][m] * tempmat[k][m];
        }
        matice3[j][k] = temp;
    }
}
timer.stop();

So you can see how just cache locality increased your code's performance quite substantially. Now real dgemm implementations exploit that to a very extensive level: They perform the multiplication on blocks of the matrix defined by the size of the TLB (Translation lookaside buffer, long story short: what can effectively be cached), so that they stream to the processor exactly the amount of data it can process. The other aspect is vectorization, they use the processor's vectorized instructions for optimal instruction throughput, which you can't really do from your cross-platform C++ code.

最后,有人声称这是因为Strassen's或Coppersmith-Winograd算法是错误的,这两个算法在实践中都是不可实现的,因为上面提到的硬件方面的考虑。

其他回答

这种问题反复出现,在Stack Overflow上应该比“MATLAB使用高度优化的库”或“MATLAB使用MKL”更清楚地回答。

历史:

矩阵乘法(连同矩阵-向量、向量-向量乘法和许多矩阵分解)是线性代数中最重要的问题。工程师们从早期开始就一直在用计算机解决这些问题。

I'm not an expert on the history, but apparently back then, everybody just rewrote his FORTRAN version with simple loops. Some standardization then came along, with the identification of "kernels" (basic routines) that most linear algebra problems needed in order to be solved. These basic operations were then standardized in a specification called: Basic Linear Algebra Subprograms (BLAS). Engineers could then call these standard, well-tested BLAS routines in their code, making their work much easier.

布拉斯特区:

BLAS从第1级(定义标量-向量和向量-向量运算的第一个版本)发展到第2级(向量-矩阵运算),再到第3级(矩阵-矩阵运算),并提供了越来越多的“核心”,从而标准化了越来越多的基本线性代数运算。最初的FORTRAN 77实现仍然可以在Netlib的网站上找到。

为了更好的表现:

因此,多年来(特别是在BLAS级别1和级别2发布之间:80年代初),随着矢量操作和缓存层次结构的出现,硬件发生了变化。这些演进使得大幅度提高BLAS子例程的性能成为可能。然后,不同的供应商也推出了他们越来越高效的BLAS例程实现。

我不知道所有历史上的实现(那时候我还没出生,也不是孩子),但最著名的两个实现出现在21世纪初:英特尔MKL和GotoBLAS。你的Matlab使用的是英特尔MKL,这是一个非常好的,优化的BLAS,这就解释了你所看到的出色性能。

矩阵乘法的技术细节:

那么为什么Matlab (MKL)在dgemm(双精度一般矩阵-矩阵乘法)上如此之快?简单来说:因为它使用了向量化和良好的数据缓存。用更复杂的术语来说:请参阅乔纳森•摩尔(Jonathan Moore)提供的文章。

Basically, when you perform your multiplication in the C++ code you provided, you are not at all cache-friendly. Since I suspect you created an array of pointers to row arrays, your accesses in your inner loop to the k-th column of "matice2": matice2[m][k] are very slow. Indeed, when you access matice2[0][k], you must get the k-th element of the array 0 of your matrix. Then in the next iteration, you must access matice2[1][k], which is the k-th element of another array (the array 1). Then in the next iteration you access yet another array, and so on... Since the entire matrix matice2 can't fit in the highest caches (it's 8*1024*1024 bytes large), the program must fetch the desired element from main memory, losing a lot of time.

如果您只是调换了矩阵的位置,以便访问相邻的内存地址,那么您的代码将运行得更快,因为现在编译器可以同时在缓存中加载整个行。试试这个修改过的版本:

timer.start();
float temp = 0;
//transpose matice2
for (int p = 0; p < rozmer; p++)
{
    for (int q = 0; q < rozmer; q++)
    {
        tempmat[p][q] = matice2[q][p];
    }
}
for(int j = 0; j < rozmer; j++)
{
    for (int k = 0; k < rozmer; k++)
    {
        temp = 0;
        for (int m = 0; m < rozmer; m++)
        {
            temp = temp + matice1[j][m] * tempmat[k][m];
        }
        matice3[j][k] = temp;
    }
}
timer.stop();

So you can see how just cache locality increased your code's performance quite substantially. Now real dgemm implementations exploit that to a very extensive level: They perform the multiplication on blocks of the matrix defined by the size of the TLB (Translation lookaside buffer, long story short: what can effectively be cached), so that they stream to the processor exactly the amount of data it can process. The other aspect is vectorization, they use the processor's vectorized instructions for optimal instruction throughput, which you can't really do from your cross-platform C++ code.

最后,有人声称这是因为Strassen's或Coppersmith-Winograd算法是错误的,这两个算法在实践中都是不可实现的,因为上面提到的硬件方面的考虑。

Matlab在一段时间前集成了LAPACK,所以我假设他们的矩阵乘法至少用了这么快的速度。LAPACK源代码和文档是现成的。

你也可以看看Goto和Van De Geijn的论文“高性能矩阵的解剖” 乘法”在http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.1785&rep=rep1&type=pdf

这就是原因。MATLAB不像在c++代码中那样,通过遍历每一个元素来执行简单的矩阵乘法。

当然,我假设你只是用C=A*B而不是自己写一个乘法函数。

MATLAB使用英特尔的LAPACK的高度优化实现,称为英特尔数学内核库(英特尔MKL) -特别是dgemm函数。这个库充分利用了处理器的特性,包括SIMD指令和多核处理器。他们没有记录他们使用的具体算法。如果从c++调用Intel MKL,应该会看到类似的性能。

我不确定MATLAB使用什么库来进行GPU乘法,但可能是nVidia CUBLAS之类的。

因为MATLAB最初是为数值线性代数(矩阵操作)开发的编程语言,它有专门为矩阵乘法开发的库。现在MATLAB也可以使用图形处理器(图形处理单元)来实现这一点。

如果我们看一下计算结果:

2048x2048 4096x4096 --------- --------- --------- CUDA C (ms) 43.11 391.05 3407.99 c++ (ms) 6137.10 64369.29 551390.93 c# (ms) 10509.00 300684.00 2527250.00 Java (ms) 9149.90 92562.28 838357.94 MATLAB (ms) 75.01 423.10 3133.90

然后我们可以看到不仅MATLAB在矩阵乘法方面如此之快:CUDA C(来自NVIDIA的编程语言)有一些比MATLAB更好的结果。CUDA C也有专门为矩阵乘法开发的库,它使用gpu。

MATLAB简史

Cleve Moler, the chairman of the computer science department at the University of New Mexico, started developing MATLAB in the late 1970s. He designed it to give his students access to LINPACK (a software library for performing numerical linear algebra) and EISPACK (is a software library for numerical computation of linear algebra) without them having to learn Fortran. It soon spread to other universities and found a strong audience within the applied mathematics community. Jack Little, an engineer, was exposed to it during a visit Moler made to Stanford University in 1983. Recognizing its commercial potential, he joined with Moler and Steve Bangert. They rewrote MATLAB in C and founded MathWorks in 1984 to continue its development. These rewritten libraries were known as JACKPAC. In 2000, MATLAB was rewritten to use a newer set of libraries for matrix manipulation, LAPACK (is a standard software library for numerical linear algebra). Source

什么是CUDA C

CUDA C还使用专门为矩阵乘法开发的库,如OpenGL(开放图形库)。它还使用GPU和Direct3D(在MS Windows上)。

The CUDA platform is designed to work with programming languages such as C, C++, and Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills in graphics programming. Also, CUDA supports programming frameworks such as OpenACC and OpenCL. Example of CUDA processing flow: Copy data from main memory to GPU memory CPU initiates the GPU compute kernel GPU's CUDA cores execute the kernel in parallel Copy the resulting data from GPU memory to main memory

比较CPU和GPU的执行速度

We ran a benchmark in which we measured the amount of time it took to execute 50 time steps for grid sizes of 64, 128, 512, 1024, and 2048 on an Intel Xeon Processor X5650 and then using an NVIDIA Tesla C2050 GPU. For a grid size of 2048, the algorithm shows a 7.5x decrease in compute time from more than a minute on the CPU to less than 10 seconds on the GPU. The log scale plot shows that the CPU is actually faster for small grid sizes. As the technology evolves and matures, however, GPU solutions are increasingly able to handle smaller problems, a trend that we expect to continue. Source

来自CUDA C编程指南的介绍:

Driven by the insatiable market demand for realtime, high-definition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth, as illustrated by Figure 1 and Figure 2. Figure 1. Floating-Point Operations per Second for the CPU and GPU Figure 2. Memory Bandwidth for the CPU and GPU The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 3. Figure 3. The GPU Devotes More Transistors to Data Processing More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches. Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology. Source

先进的阅读

图形处理器(图形处理器) MATLAB C编程指南 在MATLAB中使用gpu 基本线性代数子程序(BLAS) 《解析高性能矩阵乘法》,Kazushige Goto和Robert A. Van De Geijn著


一些有趣的面孔

我写过c++矩阵乘法,它和Matlab一样快,但它需要一些小心。(在Matlab使用图形处理器之前)。 Сitation从这个答案。