NumPy相对于常规Python列表的优势是什么?

我有大约100个金融市场系列,我将创建一个包含100x100x100 = 100万个单元格的立方体数组。我将用每个y和z回归(3个变量)每个x,用标准误差填充数组。

我听说对于“大型矩阵”,出于性能和可伸缩性的考虑,我应该使用NumPy而不是Python列表。问题是,我知道Python列表,它们似乎对我有用。

如果我转移到NumPy会有什么好处?

如果我有1000个系列(即立方体中有10亿个浮点单元)会怎样?


当前回答

以下是scipy.org网站常见问题解答中的一个很好的答案:

NumPy数组比(嵌套的)Python列表有什么优势?

Python’s lists are efficient general-purpose containers. They support (fairly) efficient insertion, deletion, appending, and concatenation, and Python’s list comprehensions make them easy to construct and manipulate. However, they have certain limitations: they don’t support “vectorized” operations like elementwise addition and multiplication, and the fact that they can contain objects of differing types mean that Python must store type information for every element, and must execute type dispatching code when operating on each element. This also means that very few list operations can be carried out by efficient C loops – each iteration would require type checks and other Python API bookkeeping.

其他回答

Alex提到了内存效率,Roberto提到了便利性,这些都是很好的观点。至于更多的想法,我将提到速度和功能。

功能:你有很多内置NumPy, FFTs,卷积,快速搜索,基本统计,线性代数,直方图等。说真的,谁能离开FFTs呢?

速度:这是一个对列表和NumPy数组进行求和的测试,显示NumPy数组上的和快10倍(在这个测试中-里程可能会有所不同)。

from numpy import arange
from timeit import Timer

Nelements = 10000
Ntimeits = 10000

x = arange(Nelements)
y = range(Nelements)

t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,))
print("list:  %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))

在我的系统上(当我运行备份时)给出:

numpy: 3.004e-05
list:  5.363e-04

以下是scipy.org网站常见问题解答中的一个很好的答案:

NumPy数组比(嵌套的)Python列表有什么优势?

Python’s lists are efficient general-purpose containers. They support (fairly) efficient insertion, deletion, appending, and concatenation, and Python’s list comprehensions make them easy to construct and manipulate. However, they have certain limitations: they don’t support “vectorized” operations like elementwise addition and multiplication, and the fact that they can contain objects of differing types mean that Python must store type information for every element, and must execute type dispatching code when operating on each element. This also means that very few list operations can be carried out by efficient C loops – each iteration would require type checks and other Python API bookkeeping.

NumPy的数组比Python的列表更紧凑——正如你所描述的那样,在Python中,列表的列表至少需要20mb左右,而在单元格中具有单精度浮点数的NumPy 3D数组则需要4mb。使用NumPy读取和写入项的访问也更快。

也许对于一百万个单元格你不会那么在意,但是对于十亿个单元格你肯定会这么在意——这两种方法都不适合32位架构,但是对于64位构建,NumPy只需要4gb左右,Python就需要至少12gb(大量指针,它们的大小会翻倍)——这是一件昂贵得多的硬件!

差异主要是由于“间接性”——Python列表是指向Python对象的指针数组,每个指针至少4个字节,即使是最小的Python对象也至少16个字节(类型指针4个字节,引用计数4个字节,值4个字节——内存分配器四舍五入为16)。NumPy数组是一个统一值的数组——单精度数字每个占用4个字节,双精度数字占用8个字节。灵活性较差,但您为标准Python列表的灵活性付出了大量代价!

NumPy不仅效率更高;它也更方便。你可以免费得到很多向量和矩阵的运算,有时可以避免不必要的工作。它们也得到了有效的实施。

例如,你可以直接从文件中读入一个数组:

x = numpy.fromfile(file=open("data"), dtype=float).reshape((100, 100, 100))

沿着第二个维度求和:

s = x.sum(axis=1)

找出超过阈值的单元格:

(x > 0.5).nonzero()

移除第三维度上的每一个偶数索引切片:

x[:, :, ::2]

此外,许多有用的库都使用NumPy数组。例如,统计分析和可视化库。

即使您没有性能问题,学习NumPy也是值得的。

所有这些都强调了numpy数组和python列表之间几乎所有的主要区别,我将在这里简要介绍一下:

Numpy arrays have a fixed size at creation, unlike python lists (which can grow dynamically). Changing the size of ndarray will create a new array and delete the original. The elements in a Numpy array are all required to be of the same data type (we can have the heterogeneous type as well but that will not gonna permit you mathematical operations) and thus will be the same size in memory Numpy arrays are facilitated advances mathematical and other types of operations on large numbers of data. Typically such operations are executed more efficiently and with less code than is possible using pythons build in sequences