用熊猫循环数据帧的最有效方法是什么?

我相信循环dataframe最简单有效的方法是使用numpy和numba。在这种情况下，循环在许多情况下可以近似地与向量化操作一样快。如果numba不是一个选项，那么普通numpy可能是下一个最佳选项。正如已经多次提到的，您的默认值应该是向量化的，但是这个答案仅仅考虑了有效的循环，无论出于什么原因决定了循环。

对于测试用例，让我们使用@DSM回答的计算百分比变化的示例。这是一个非常简单的情况，作为一个实际问题，你不会写一个循环来计算它，但这样它为时间向量化方法和循环提供了一个合理的基线。

让我们用一个小的DataFrame来设置这4种方法，下面我们将在一个更大的数据集上对它们进行计时。

import pandas as pd
import numpy as np
import numba as nb

df = pd.DataFrame( { 'close':[100,105,95,105] } )

pandas_vectorized = df.close.pct_change()[1:]

x = df.close.to_numpy()
numpy_vectorized = ( x[1:] - x[:-1] ) / x[:-1]
        
def test_numpy(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng

numpy_loop = test_numpy(df.close.to_numpy())[1:]

@nb.jit(nopython=True)
def test_numba(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng
    
numba_loop = test_numba(df.close.to_numpy())[1:]

下面是100,000行的DataFrame上的计时(使用Jupyter的%timeit函数执行的计时，为了便于阅读，折叠成摘要表):

pandas/vectorized   1,130 micro-seconds
numpy/vectorized      382 micro-seconds
numpy/looped       72,800 micro-seconds
numba/looped          455 micro-seconds

总结:对于简单的情况，比如这个例子，为了简单和可读性，可以使用(向量化的)pandas，为了速度，可以使用(向量化的)numpy。如果您确实需要使用循环，请使用numpy。如果numba可用，可以将其与numpy结合使用以获得更高的速度。在这种情况下，numpy + numba几乎和向量化numpy代码一样快。

其他细节:

Not shown are various options like iterrows, itertuples, etc. which are orders of magnitude slower and really should never be used. The timings here are fairly typical: numpy is faster than pandas and vectorized is faster than loops, but adding numba to numpy will often speed numpy up dramatically. Everything except the pandas option requires converting the DataFrame column to a numpy array. That conversion is included in the timings. The time to define/compile the numpy/numba functions was not included in the timings, but would generally be a negligible component of the timing for any large dataframe.

2020-12-22 19:06:17

你可以通过换位然后调用iteritems来遍历这些行:

for date, row in df.T.iteritems():
   # do some logic here

我对这种情况下的效率没有把握。为了在迭代算法中获得尽可能好的性能，您可能想要探索用Cython编写它，因此您可以这样做:

def my_algo(ndarray[object] dates, ndarray[float64_t] open,
            ndarray[float64_t] low, ndarray[float64_t] high,
            ndarray[float64_t] close, ndarray[float64_t] volume):
    cdef:
        Py_ssize_t i, n
        float64_t foo
    n = len(dates)

    for i from 0 <= i < n:
        foo = close[i] - open[i] # will be extremely fast

我建议先用纯Python编写算法，确保它能工作，然后看看它有多快——如果不够快，就用最小的工作量把东西转换成这样的Cython，以得到与手工编写的C/ c++差不多快的东西。

2011-10-21 13:04:53

在注意到Nick Crawford的答案后，我检查了iterrows，但发现它产生(index, Series)元组。不确定哪种方法最适合您，但我最终使用了itertuples方法来解决我的问题，该方法生成(index, row_value1…)元组。

还有iterkv，它遍历(column, series)元组。

2012-07-29 04:53:26

我相信循环dataframe最简单有效的方法是使用numpy和numba。在这种情况下，循环在许多情况下可以近似地与向量化操作一样快。如果numba不是一个选项，那么普通numpy可能是下一个最佳选项。正如已经多次提到的，您的默认值应该是向量化的，但是这个答案仅仅考虑了有效的循环，无论出于什么原因决定了循环。

对于测试用例，让我们使用@DSM回答的计算百分比变化的示例。这是一个非常简单的情况，作为一个实际问题，你不会写一个循环来计算它，但这样它为时间向量化方法和循环提供了一个合理的基线。

让我们用一个小的DataFrame来设置这4种方法，下面我们将在一个更大的数据集上对它们进行计时。

import pandas as pd
import numpy as np
import numba as nb

df = pd.DataFrame( { 'close':[100,105,95,105] } )

pandas_vectorized = df.close.pct_change()[1:]

x = df.close.to_numpy()
numpy_vectorized = ( x[1:] - x[:-1] ) / x[:-1]
        
def test_numpy(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng

numpy_loop = test_numpy(df.close.to_numpy())[1:]

@nb.jit(nopython=True)
def test_numba(x):
    pct_chng = np.zeros(len(x))
    for i in range(1,len(x)):
        pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
    return pct_chng
    
numba_loop = test_numba(df.close.to_numpy())[1:]

下面是100,000行的DataFrame上的计时(使用Jupyter的%timeit函数执行的计时，为了便于阅读，折叠成摘要表):

pandas/vectorized   1,130 micro-seconds
numpy/vectorized      382 micro-seconds
numpy/looped       72,800 micro-seconds
numba/looped          455 micro-seconds

总结:对于简单的情况，比如这个例子，为了简单和可读性，可以使用(向量化的)pandas，为了速度，可以使用(向量化的)numpy。如果您确实需要使用循环，请使用numpy。如果numba可用，可以将其与numpy结合使用以获得更高的速度。在这种情况下，numpy + numba几乎和向量化numpy代码一样快。

其他细节:

Not shown are various options like iterrows, itertuples, etc. which are orders of magnitude slower and really should never be used. The timings here are fairly typical: numpy is faster than pandas and vectorized is faster than loops, but adding numba to numpy will often speed numpy up dramatically. Everything except the pandas option requires converting the DataFrame column to a numpy array. That conversion is included in the timings. The time to define/compile the numpy/numba functions was not included in the timings, but would generally be a negligible component of the timing for any large dataframe.

2020-12-22 19:06:17

另一个建议是将groupby与向量化计算结合起来，如果行的子集共享允许这样做的特征。

2014-11-14 12:30:51

当然，遍历数据帧的最快方法是通过df访问底层numpy ndarray。值(如您所做的那样)或通过分别访问每一列df.column_name.values。因为你也想访问索引，你可以使用df.index.values。

index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values

for i in range(df.shape[0]):
   index_value = index[i]
   ...
   column_value_k = column_of_interest_k[i]

不是神谕的吗?当然。但很快。

如果你想从循环中挤出更多的果汁，你会想看看cython。Cython将让你获得巨大的加速(想想10 -100倍)。检查cython的内存视图以获得最大性能。

2018-03-23 01:51:44

用熊猫循环数据帧的最有效方法是什么?

推荐文章

最新文章

标签