如何将函数应用到熊猫数据框架的两列

假设我有一个df，它的列是" ID " " col_1 " " col_2 "我定义了一个函数:

F = x, y: my_function_expression。

现在我想应用f到df的两个列'col_1'， 'col_2'来逐个元素计算一个新列'col_3'，有点像:

df['col_3'] = df[['col_1','col_2']].apply(f)  
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'

怎么办?

**添加详细示例如下***

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

  ID  col_1  col_2            col_3
0  1      0      1       ['a', 'b']
1  2      2      4  ['c', 'd', 'e']
2  3      3      5  ['d', 'e', 'f']

当前回答

这里有一个更快的解决方案:

def func_1(a,b):
    return a + b

df["C"] = func_1(df["A"].to_numpy(),df["B"].to_numpy())

这比df快380倍。从@Aman应用(f，轴=1)，比df['col_3'] = df快310倍。应用(x: f(x。Col_1, x.col_2)， axis=1) from @ajrwhite。

我还添加了一些基准:

结果:

  FUNCTIONS   TIMINGS   GAIN
apply lambda    0.7     x 1
apply           0.56    x 1.25
map             0.3     x 2.3
np.vectorize    0.01    x 70
f3 on Series    0.0026  x 270
f3 on np arrays 0.0018  x 380
f3 numba        0.0018  x 380

简而言之:

使用apply很慢。我们可以非常简单地加快速度，只需要使用一个函数直接操作Pandas系列(或者更好地操作numpy数组)。因为我们将操作Pandas Series或numpy数组，我们将能够向量化操作。该函数将返回一个Pandas Series或numpy数组，我们将其赋值为一个新列。

下面是基准代码:

import timeit

timeit_setup = """
import pandas as pd
import numpy as np
import numba

np.random.seed(0)

# Create a DataFrame of 10000 rows with 2 columns "A" and "B" 
# containing integers between 0 and 100
df = pd.DataFrame(np.random.randint(0,10,size=(10000, 2)), columns=["A", "B"])

def f1(a,b):
    # Here a and b are the values of column A and B for a specific row: integers
    return a + b

def f2(x):
    # Here, x is pandas Series, and corresponds to a specific row of the DataFrame
    # 0 and 1 are the indexes of columns A and B
    return x[0] + x[1]  

def f3(a,b):
    # Same as f1 but we will pass parameters that will allow vectorization
    # Here, A and B will be Pandas Series or numpy arrays
    # with df["C"] = f3(df["A"],df["B"]): Pandas Series
    # with df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy()): numpy arrays
    return a + b

@numba.njit('int64[:](int64[:], int64[:])')
def f3_numba_vectorize(a,b):
    # Here a and b are 2 numpy arrays with dtype int64
    # This function must return a numpy array whith dtype int64
    return a + b

"""

test_functions = [
'df["C"] = df.apply(lambda row: f1(row["A"], row["B"]), axis=1)',
'df["C"] = df.apply(f2, axis=1)',
'df["C"] = list(map(f3,df["A"],df["B"]))',
'df["C"] = np.vectorize(f3) (df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3(df["A"],df["B"])',
'df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3_numba_vectorize(df["A"].to_numpy(),df["B"].to_numpy())'
]


for test_function in test_functions:
    print(min(timeit.repeat(setup=timeit_setup, stmt=test_function, repeat=7, number=10)))

输出:

最后注意:事情可以优化Cython和其他numba技巧。

2022-01-28 19:33:44

其他回答

这里有一个更快的解决方案:

def func_1(a,b):
    return a + b

df["C"] = func_1(df["A"].to_numpy(),df["B"].to_numpy())

这比df快380倍。从@Aman应用(f，轴=1)，比df['col_3'] = df快310倍。应用(x: f(x。Col_1, x.col_2)， axis=1) from @ajrwhite。

我还添加了一些基准:

结果:

  FUNCTIONS   TIMINGS   GAIN
apply lambda    0.7     x 1
apply           0.56    x 1.25
map             0.3     x 2.3
np.vectorize    0.01    x 70
f3 on Series    0.0026  x 270
f3 on np arrays 0.0018  x 380
f3 numba        0.0018  x 380

简而言之:

下面是基准代码:

import timeit

timeit_setup = """
import pandas as pd
import numpy as np
import numba

np.random.seed(0)

# Create a DataFrame of 10000 rows with 2 columns "A" and "B" 
# containing integers between 0 and 100
df = pd.DataFrame(np.random.randint(0,10,size=(10000, 2)), columns=["A", "B"])

def f1(a,b):
    # Here a and b are the values of column A and B for a specific row: integers
    return a + b

def f2(x):
    # Here, x is pandas Series, and corresponds to a specific row of the DataFrame
    # 0 and 1 are the indexes of columns A and B
    return x[0] + x[1]  

def f3(a,b):
    # Same as f1 but we will pass parameters that will allow vectorization
    # Here, A and B will be Pandas Series or numpy arrays
    # with df["C"] = f3(df["A"],df["B"]): Pandas Series
    # with df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy()): numpy arrays
    return a + b

@numba.njit('int64[:](int64[:], int64[:])')
def f3_numba_vectorize(a,b):
    # Here a and b are 2 numpy arrays with dtype int64
    # This function must return a numpy array whith dtype int64
    return a + b

"""

test_functions = [
'df["C"] = df.apply(lambda row: f1(row["A"], row["B"]), axis=1)',
'df["C"] = df.apply(f2, axis=1)',
'df["C"] = list(map(f3,df["A"],df["B"]))',
'df["C"] = np.vectorize(f3) (df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3(df["A"],df["B"])',
'df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3_numba_vectorize(df["A"].to_numpy(),df["B"].to_numpy())'
]


for test_function in test_functions:
    print(min(timeit.repeat(setup=timeit_setup, stmt=test_function, repeat=7, number=10)))

输出:

最后注意:事情可以优化Cython和其他numba技巧。

2022-01-28 19:33:44

一个有趣的问题!我的回答如下:

import pandas as pd

def sublst(row):
    return lst[row['J1']:row['J2']]

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(sublst,axis=1)
print df

输出:

  ID  J1  J2
0  1   0   1
1  2   2   4
2  3   3   5
  ID  J1  J2      J3
0  1   0   1     [a]
1  2   2   4  [c, d]
2  3   3   5  [d, e]

我将列名更改为ID,J1,J2,J3，以确保ID < J1 < J2 < J3，因此列以正确的顺序显示。

再简单说一下:

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df

2015-04-24 02:33:45

您正在寻找的方法是Series.combine。然而，在数据类型方面似乎需要多加注意。在您的示例中，您会(就像我在测试答案时那样)天真地调用

df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)

但是，这会抛出错误:

ValueError: setting an array element with a sequence.

我最好的猜测是，它似乎期望结果与调用方法的系列(df。col_1这里)。然而，以下工作:

df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)

df

   ID   col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

2015-03-05 15:20:26

你写f的方法需要两个输入。如果你看一下错误消息，它说你没有为f提供两个输入，只有一个。错误信息是正确的。不匹配是因为df[['col1'，'col2']]返回一个有两列的数据帧，而不是两个独立的列。

你需要改变你的f，让它只接受一个输入，保持上面的数据帧作为输入，然后在函数体中把它分解成x,y。然后执行所需的操作并返回一个值。

你需要这个函数签名，因为语法是。apply(f) f需要取一个= dataframe的东西，而不是当前f所期望的两个东西。

由于你没有提供f的主体，我不能提供更多的细节-但这应该提供了出路，而不需要从根本上改变你的代码或使用一些其他方法而不是应用

2013-05-30 00:53:50

有两种简单的方法: 比方说，我们想在名为col_sum的输出列中求col1和col2的和

方法1

f = lambda x : x.col1 + x.col2
df['col_sum'] = df.apply(f, axis=1)

方法2

def f(x):
    x['col_sum'] = x.col_1 + col_2
    return x
df = df.apply(f, axis=1)

当一些复杂的函数必须应用到数据帧时，应该使用方法2。当需要多列输出时，也可以使用方法2。

2022-04-14 20:23:10

如何将函数应用到熊猫数据框架的两列

推荐文章

最新文章

标签