我想将我的自定义函数(它使用if-else阶梯)应用到这六列(ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr。Amer, ERI_HI_PacIsl, ERI_White)在我的数据帧的每一行。

I've tried different methods from other questions but still can't seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can't be counted as anything else. Even if they have a "1" in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can't be counted as a unique ethnicity(except for Hispanic).

这几乎就像对每一行进行for循环,如果每个记录满足一个条件,它们就被添加到一个列表中,并从原始列表中删除。

从下面的数据框架中,我需要根据SQL中的以下规范计算一个新列:

标准

IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”

备注:如果西班牙裔的ERI标志为真(1),则该员工被归类为“西班牙裔”

备注:如果多于1个非西班牙ERI Flag为真,返回" Two or more "

DATAFRAME

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White

当前回答

还有另一种(易于推广的)方法,其基础是pandas.DataFrame.idxmax。首先,易于概括的序言。

# Indeed, all your conditions boils down to the following
_gt_1_key = 'two_or_more'
_lt_1_key = 'other'

# The "dictionary-based" if-else statements
labels = {
    _gt_1_key     : 'Two Or More',
    'eri_hispanic': 'Hispanic',
    'eri_nat_amer': 'A/I AK Native',
    'eri_asian'   : 'Asian',
    'eri_afr_amer': 'Black/AA',
    'eri_hawaiian': 'Haw/Pac Isl.',
    'eri_white'   : 'White',  
    _lt_1_key     : 'Other',
}

# The output-driving 1-0 matrix
mat = df.filter(regex='^eri_').copy()  # `~.copy` to avoid `SettingWithCopyWarning`

... 最后,以向量化的方式:

mat[_gt_1_key] = gt1 = mat.sum(axis=1)
mat[_lt_1_key] = gt1.eq(0).astype(int)
race_label     = mat.idxmax(axis=1).map(labels)

在哪里

>>> race_label
0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

那是一只熊猫。您可以轻松地在df中托管系列实例,即df['race_label'] = race_label。

其他回答

上面的答案是完全有效的,但是存在一个向量化的解决方案,形式为numpy.select。这允许你定义条件,然后定义这些条件的输出,比使用apply更有效:


首先,定义条件:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

现在,定义相应的输出:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

最后,使用numpy.select:

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

为什么要numpy。选择使用而不是应用?下面是一些性能检查:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用numpy。Select极大地提高了性能,并且随着数据的增长,差异只会增加。

如果我们检查它的源代码,apply()是Python for循环的语法糖(通过FrameApply类的apply_series_generator()方法)。因为它有pandas开销,所以通常比Python循环要慢。

尽可能使用优化的(向量化的)方法。如果必须使用循环,请使用@numba。jit装饰。

1. 不要使用apply()作为if-else阶梯

Df.apply()是在pandas中最慢的方法。如user3483203和Mohamed Thasin ah的回答所示,根据数据帧大小,np.select()和df. select()。要产生相同的输出,Loc可能比df.apply()快50-300倍。

碰巧,使用numba模块的@jit装饰器的循环实现(与apply()类似)比df快(大约50-60%)。Loc和np.select.1

Numba在numpy数组上工作,因此在使用jit装饰器之前,需要将数据帧转换为numpy数组。然后通过检查循环中的条件在预先初始化的空数组中填充值。由于numpy数组没有列名,所以必须通过循环中的索引访问列。与apply()中的if-else天梯相比,jit函数中的if-else天梯最不方便的部分是通过索引访问列。否则,它几乎是相同的实现。

import numpy as np
import numba as nb
@nb.jit(nopython=True)
def conditional_assignment(arr, res):    
    length = len(arr)
    for i in range(length):
        if arr[i][3] == 1:
            res[i] = 'Hispanic'
        elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1:
            res[i] = 'Two Or More'
        elif arr[i][0]  == 1:
            res[i] = 'Black/AA'
        elif arr[i][1] == 1:
            res[i] = 'Asian'
        elif arr[i][2] == 1:
            res[i] = 'Haw/Pac Isl.'
        elif arr[i][4] == 1:
            res[i] = 'A/I AK Native'
        elif arr[i][5] == 1:
            res[i] = 'White'
        else:
            res[i] = 'Other'
    return res

# the columns with the boolean data
cols = [c for c in df.columns if c.startswith('eri_')]
# initialize an empty array to be filled in a loop
# for string dtype arrays, we need to know the length of the longest string
# and use it to set the dtype
res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
# pass the underlying numpy array of `df[cols]` into the jitted function
df['rno_defined'] = conditional_assignment(df[cols].values, res)

2. 不要使用apply()进行数值操作

如果需要通过添加两列来添加新行,您的第一反应可能是写入

df['c'] = df.apply(lambda row: row['a'] + row['b'], axis=1)

但与此相反,使用sum(axis=1)方法逐行添加(如果只有两列,则使用+运算符):

df['c'] = df[['a','b']].sum(axis=1)
# equivalently
df['c'] = df['a'] + df['b']

根据数据帧的大小,sum(1)可能比apply()快100倍。

事实上,你几乎不需要apply()在pandas数据帧上进行数值操作,因为它已经优化了大多数操作的方法:加法(sum(1))、减法(sub()或diff())、乘法(prod(1))、除法(div()或/)、幂(pow())、>、>=、==、%、//、&、|等都可以在整个数据帧上执行,而不需要apply()。

例如,假设你想用下面的规则创建一个新列:

IF [colC] > 0 THEN RETURN [colA] * [colB]
ELSE RETURN [colA] / [colB]

使用优化的pandas方法,可以写成

df['new'] = df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])

等效的apply()解决方案是:

df['new'] = df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)

对于具有20k行的数据帧,使用优化方法的方法比等效的apply()方法快250倍。这种差距只会随着数据大小的增加而增加(对于具有1 mil行的数据帧,它要快365倍),并且时间差将变得越来越明显



1

: In the below result, I show the performance of the three approaches using a dataframe with 24 mil rows (this is the largest frame I can construct on my machine). For smaller frames, the numba-jitted function consistently runs at least 50% faster than the other two as well (you can check yourself).
def pd_loc(df):
    df['rno_defined'] = 'Other'
    df.loc[df['eri_nat_amer'] == 1, 'rno_defined'] = 'A/I AK Native'
    df.loc[df['eri_asian'] == 1, 'rno_defined'] = 'Asian'
    df.loc[df['eri_afr_amer'] == 1, 'rno_defined'] = 'Black/AA'
    df.loc[df['eri_hawaiian'] == 1, 'rno_defined'] = 'Haw/Pac Isl.'
    df.loc[df['eri_white'] == 1, 'rno_defined'] = 'White'
    df.loc[df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1) > 1, 'rno_defined'] = 'Two Or More'
    df.loc[df['eri_hispanic'] == 1, 'rno_defined'] = 'Hispanic'
    return df

def np_select(df):
    conditions = [df['eri_hispanic'] == 1,
                  df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
                  df['eri_nat_amer'] == 1,
                  df['eri_asian'] == 1,
                  df['eri_afr_amer'] == 1,
                  df['eri_hawaiian'] == 1,
                  df['eri_white'] == 1]
    outputs = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']
    df['rno_defined'] = np.select(conditions, outputs, 'Other')
    return df


@nb.jit(nopython=True)
def conditional_assignment(arr, res):
    
    length = len(arr)
    for i in range(length):
        if arr[i][3] == 1 :
            res[i] = 'Hispanic'
        elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1 :
            res[i] = 'Two Or More'
        elif arr[i][0]  == 1:
            res[i] = 'Black/AA'
        elif arr[i][1] == 1:
            res[i] = 'Asian'
        elif arr[i][2] == 1:
            res[i] = 'Haw/Pac Isl.'
        elif arr[i][4] == 1 :
            res[i] = 'A/I AK Native'
        elif arr[i][5] == 1:
            res[i] = 'White'
        else:
            res[i] = 'Other'
            
    return res

def nb_loop(df):
    cols = [c for c in df.columns if c.startswith('eri_')]
    res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
    df['rno_defined'] = conditional_assignment(df[cols].values, res)
    return df

# df with 24mil rows
n = 4_000_000
df = pd.DataFrame({
    'eri_afr_amer': [0, 0, 0, 0, 0, 0]*n, 
    'eri_asian': [1, 0, 0, 0, 0, 0]*n, 
    'eri_hawaiian': [0, 0, 0, 1, 0, 0]*n, 
    'eri_hispanic': [0, 1, 0, 0, 1, 0]*n, 
    'eri_nat_amer': [0, 0, 0, 0, 1, 0]*n, 
    'eri_white': [0, 0, 1, 1, 0, 0]*n
}, dtype='int8')
df.insert(0, 'name', ['MOST', 'CRUISE', 'DEPP', 'DICAP', 'BRANDO', 'HANKS']*n)

%timeit nb_loop(df)
# 5.23 s ± 45.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit pd_loc(df)
# 7.97 s ± 28.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit np_select(df)
# 8.5 s ± 39.6 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

2:在下面的结果中,我展示了两种方法的性能,分别使用一个具有20k行和1 mil行的数据框架。对于更小的帧,间隔更小,因为当apply()是一个循环时,优化的方法有一个开销。随着帧大小的增加,向量化开销w.r.t.减少到代码的总体运行时,而apply()仍然是帧上的循环。

n = 20_000 # 1_000_000
df = pd.DataFrame(np.random.rand(n,3)-0.5, columns=['colA','colB','colC'])

%timeit df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])
# n = 20000: 2.69 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# n = 1000000: 86.2 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)
# n = 20000: 679 ms ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# n = 1000000: 31.5 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

因为这是'pandas new column from others'的第一个谷歌结果,这里有一个简单的例子:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

如果你得到SettingWithCopyWarning,你也可以这样做:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

来源:https://stackoverflow.com/a/12555510/243392

如果你的列名包含空格,你可以使用这样的语法:

df = df.assign(**{'some column name': col.values})

这是apply和assign的文档。

还有另一种(易于推广的)方法,其基础是pandas.DataFrame.idxmax。首先,易于概括的序言。

# Indeed, all your conditions boils down to the following
_gt_1_key = 'two_or_more'
_lt_1_key = 'other'

# The "dictionary-based" if-else statements
labels = {
    _gt_1_key     : 'Two Or More',
    'eri_hispanic': 'Hispanic',
    'eri_nat_amer': 'A/I AK Native',
    'eri_asian'   : 'Asian',
    'eri_afr_amer': 'Black/AA',
    'eri_hawaiian': 'Haw/Pac Isl.',
    'eri_white'   : 'White',  
    _lt_1_key     : 'Other',
}

# The output-driving 1-0 matrix
mat = df.filter(regex='^eri_').copy()  # `~.copy` to avoid `SettingWithCopyWarning`

... 最后,以向量化的方式:

mat[_gt_1_key] = gt1 = mat.sum(axis=1)
mat[_lt_1_key] = gt1.eq(0).astype(int)
race_label     = mat.idxmax(axis=1).map(labels)

在哪里

>>> race_label
0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

那是一只熊猫。您可以轻松地在df中托管系列实例,即df['race_label'] = race_label。

试试这个,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O/P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White 

使用.loc代替apply。

它改进了向量化。

.loc的工作方式很简单,根据条件屏蔽行,对冻结行应用值。

欲了解更多细节,请访问。loc文档

性能指标:

答:接受

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

每循环1.15 s±46.5 ms(平均±标准值7次运行,每循环1次)

我建议的答案:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

每循环24.7 ms±1.7 ms(平均±标准值7次运行,每循环10次)