我想将我的自定义函数(它使用if-else阶梯)应用到这六列(ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr。Amer, ERI_HI_PacIsl, ERI_White)在我的数据帧的每一行。

I've tried different methods from other questions but still can't seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can't be counted as anything else. Even if they have a "1" in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can't be counted as a unique ethnicity(except for Hispanic).

这几乎就像对每一行进行for循环,如果每个记录满足一个条件,它们就被添加到一个列表中,并从原始列表中删除。

从下面的数据框架中,我需要根据SQL中的以下规范计算一个新列:

标准

IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”

备注:如果西班牙裔的ERI标志为真(1),则该员工被归类为“西班牙裔”

备注:如果多于1个非西班牙ERI Flag为真,返回" Two or more "

DATAFRAME

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White

当前回答

根据标准的复杂程度选择方法

对于下面的例子——为了为新列显示多种类型的规则——我们将假设DataFrame包含列“red”、“green”和“blue”,包含从0到1的浮点值。

一般情况:.apply

只要计算新值所需的逻辑可以写成同一行中其他值的函数,我们就可以使用DataFrame的.apply方法来获得所需的结果。编写函数,使其接受单个参数,即输入的单行:

def as_hex(value):
    # clamp to avoid rounding errors etc.
    return min(max(0, int(value * 256)), 255)

def hex_color(row):
    r, g, b = as_hex(row['red']), as_hex(row['green']), as_hex(row['blue'])
    return f'#{r:02x}{g:02x}{b:02x}'

将函数本身(不要在名称后面写圆括号)传递给.apply,并指定axis=1(意味着向分类函数提供行,以便计算列-而不是相反)。因此:

df['hex_color'] = df.apply(hex_color, axis=1)

注意,在lambda中包装是不必要的,因为我们没有绑定任何参数或以其他方式修改函数。

apply步骤是必要的,因为转换函数本身不是向量化的。因此,像df['color'] = hex_color(df)这样的简单方法将不起作用(示例问题)。

这个工具功能强大,但效率低下。为了获得最佳性能,请在适用的情况下使用更具体的方法。

带条件的多种选择:numpy。选择或重复赋值df。Loc或df.where

假设我们对颜色值进行阈值设置,并计算大致的颜色名称,如下所示:

def additive_color(row):
    # Insert here: logic that takes values from the `row` and computes
    # the desired cell value for the new column in that row.
    # The `row` is an ordinary `Series` object representing a row of the
    # original `DataFrame`; it can be indexed with column names, thus:
    if row['red'] > 0.5:
        if row['green'] > 0.5:
            return 'white' if row['blue'] > 0.5 else 'yellow'
        else:
            return 'magenta' if row['blue'] > 0.5 else 'red'
    elif row['green'] > 0.5:
        return 'cyan' if row['blue'] > 0.5 else 'green'
    else:
        return 'blue' if row['blue'] > 0.5 else 'black'

在这种情况下——分类函数是一个if/else阶梯,或者3.10及更高版本中的match/case——我们可以使用numpy.select获得更快的性能。

这种方法的工作原理非常不同。首先,计算每个条件适用的数据掩码:

black = (df['red'] <= 0.5) & (df['green'] <= 0.5) & (df['blue'] <= 0.5)
white = (df['red'] > 0.5) & (df['green'] > 0.5) & (df['blue'] > 0.5)

调用numpy。选择,我们需要两个并行序列——一个是条件,另一个是对应的值:

df['color'] = np.select(
    [white, black],
    ['white', 'black'],
    'colorful'
)

可选的第三个参数指定当所有条件都不满足时使用的值。(作为练习:填写剩下的条件,并尝试不带第三个参数。)

类似的方法是基于每个条件进行重复分配。先指定默认值,然后使用df。Loc为每个条件分配特定值:

df['color'] = 'colorful'
df.loc[white, 'color'] = 'white'
df.loc[black, 'color'] = 'black'

此外,df。在哪里可以用来做作业。然而,df。Where,像这样使用,在条件不满足的地方分配指定的值,因此条件必须颠倒:

df['color'] = 'colorful'
df['color'] = df['color'].where(~white, 'white').where(~black, 'black')

简单的数学操作:内置数学运算符和广播

例如,基于应用程序的方法如下:

def brightness(row):
    return row['red'] * .299 + row['green'] * .587 + row['blue'] * .114

df['brightness'] = df.apply(brightness, axis=1)

相反,可以通过广播操作符来编写,以获得更好的性能(也更简单):

df['brightness'] = df['red'] * .299 + df['green'] * .587 + df['blue'] * .114

作为练习,下面是第一个这样做的例子:

def as_hex(column):
    scaled = (column * 256).astype(int)
    clamped = scaled.where(scaled >= 0, 0).where(scaled <= 255, 255)
    return clamped.apply(lambda i: f'{i:02x}')

df['hex_color'] = '#' + as_hex(df['red']) + as_hex(df['green']) + as_hex(df['blue'])

我无法找到一个向量化的等效格式来将整数值格式化为十六进制字符串,因此.apply仍然在这里内部使用-这意味着全速惩罚仍然发挥作用。不过,这演示了一些通用技术。

更多细节和例子,请看cottontail的回答。

其他回答

上面的答案是完全有效的,但是存在一个向量化的解决方案,形式为numpy.select。这允许你定义条件,然后定义这些条件的输出,比使用apply更有效:


首先,定义条件:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

现在,定义相应的输出:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

最后,使用numpy.select:

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

为什么要numpy。选择使用而不是应用?下面是一些性能检查:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用numpy。Select极大地提高了性能,并且随着数据的增长,差异只会增加。

好的,这有两个步骤——第一步是写一个函数来做你想要的转换——我已经根据你的伪代码把一个例子放在一起了:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

您可能想要回顾一下这一点,但它似乎做到了这一点——注意,进入函数的参数被认为是一个标记为“row”的Series对象。

接下来,使用pandas中的apply函数来应用该函数。

df.apply (lambda row: label_race(row), axis=1)

请注意axis=1说明符,这意味着应用程序是在行级别而不是列级别上完成的。结果如下:

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White

如果您对这些结果感到满意,那么再次运行它,将结果保存到原始数据框架中的一个新列中。

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

生成的数据框架是这样的(向右滚动可以看到新列):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label
0      MOST    JEFF      E             0          0             0              0             0          1       White         White
1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic
2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White
3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White
4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other
5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White
6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More
7    PACINO      AL      E             0          0             0              0             0          1       White         White
8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.
9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White

还有另一种(易于推广的)方法,其基础是pandas.DataFrame.idxmax。首先,易于概括的序言。

# Indeed, all your conditions boils down to the following
_gt_1_key = 'two_or_more'
_lt_1_key = 'other'

# The "dictionary-based" if-else statements
labels = {
    _gt_1_key     : 'Two Or More',
    'eri_hispanic': 'Hispanic',
    'eri_nat_amer': 'A/I AK Native',
    'eri_asian'   : 'Asian',
    'eri_afr_amer': 'Black/AA',
    'eri_hawaiian': 'Haw/Pac Isl.',
    'eri_white'   : 'White',  
    _lt_1_key     : 'Other',
}

# The output-driving 1-0 matrix
mat = df.filter(regex='^eri_').copy()  # `~.copy` to avoid `SettingWithCopyWarning`

... 最后,以向量化的方式:

mat[_gt_1_key] = gt1 = mat.sum(axis=1)
mat[_lt_1_key] = gt1.eq(0).astype(int)
race_label     = mat.idxmax(axis=1).map(labels)

在哪里

>>> race_label
0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

那是一只熊猫。您可以轻松地在df中托管系列实例,即df['race_label'] = race_label。

.apply()接受一个函数作为第一个形参;传入label_race函数,如下所示:

df['race_label'] = df.apply(label_race, axis=1)

你不需要创建一个lambda函数来传递一个函数。

如果我们检查它的源代码,apply()是Python for循环的语法糖(通过FrameApply类的apply_series_generator()方法)。因为它有pandas开销,所以通常比Python循环要慢。

尽可能使用优化的(向量化的)方法。如果必须使用循环,请使用@numba。jit装饰。

1. 不要使用apply()作为if-else阶梯

Df.apply()是在pandas中最慢的方法。如user3483203和Mohamed Thasin ah的回答所示,根据数据帧大小,np.select()和df. select()。要产生相同的输出,Loc可能比df.apply()快50-300倍。

碰巧,使用numba模块的@jit装饰器的循环实现(与apply()类似)比df快(大约50-60%)。Loc和np.select.1

Numba在numpy数组上工作,因此在使用jit装饰器之前,需要将数据帧转换为numpy数组。然后通过检查循环中的条件在预先初始化的空数组中填充值。由于numpy数组没有列名,所以必须通过循环中的索引访问列。与apply()中的if-else天梯相比,jit函数中的if-else天梯最不方便的部分是通过索引访问列。否则,它几乎是相同的实现。

import numpy as np
import numba as nb
@nb.jit(nopython=True)
def conditional_assignment(arr, res):    
    length = len(arr)
    for i in range(length):
        if arr[i][3] == 1:
            res[i] = 'Hispanic'
        elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1:
            res[i] = 'Two Or More'
        elif arr[i][0]  == 1:
            res[i] = 'Black/AA'
        elif arr[i][1] == 1:
            res[i] = 'Asian'
        elif arr[i][2] == 1:
            res[i] = 'Haw/Pac Isl.'
        elif arr[i][4] == 1:
            res[i] = 'A/I AK Native'
        elif arr[i][5] == 1:
            res[i] = 'White'
        else:
            res[i] = 'Other'
    return res

# the columns with the boolean data
cols = [c for c in df.columns if c.startswith('eri_')]
# initialize an empty array to be filled in a loop
# for string dtype arrays, we need to know the length of the longest string
# and use it to set the dtype
res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
# pass the underlying numpy array of `df[cols]` into the jitted function
df['rno_defined'] = conditional_assignment(df[cols].values, res)

2. 不要使用apply()进行数值操作

如果需要通过添加两列来添加新行,您的第一反应可能是写入

df['c'] = df.apply(lambda row: row['a'] + row['b'], axis=1)

但与此相反,使用sum(axis=1)方法逐行添加(如果只有两列,则使用+运算符):

df['c'] = df[['a','b']].sum(axis=1)
# equivalently
df['c'] = df['a'] + df['b']

根据数据帧的大小,sum(1)可能比apply()快100倍。

事实上,你几乎不需要apply()在pandas数据帧上进行数值操作,因为它已经优化了大多数操作的方法:加法(sum(1))、减法(sub()或diff())、乘法(prod(1))、除法(div()或/)、幂(pow())、>、>=、==、%、//、&、|等都可以在整个数据帧上执行,而不需要apply()。

例如,假设你想用下面的规则创建一个新列:

IF [colC] > 0 THEN RETURN [colA] * [colB]
ELSE RETURN [colA] / [colB]

使用优化的pandas方法,可以写成

df['new'] = df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])

等效的apply()解决方案是:

df['new'] = df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)

对于具有20k行的数据帧,使用优化方法的方法比等效的apply()方法快250倍。这种差距只会随着数据大小的增加而增加(对于具有1 mil行的数据帧,它要快365倍),并且时间差将变得越来越明显



1

: In the below result, I show the performance of the three approaches using a dataframe with 24 mil rows (this is the largest frame I can construct on my machine). For smaller frames, the numba-jitted function consistently runs at least 50% faster than the other two as well (you can check yourself).
def pd_loc(df):
    df['rno_defined'] = 'Other'
    df.loc[df['eri_nat_amer'] == 1, 'rno_defined'] = 'A/I AK Native'
    df.loc[df['eri_asian'] == 1, 'rno_defined'] = 'Asian'
    df.loc[df['eri_afr_amer'] == 1, 'rno_defined'] = 'Black/AA'
    df.loc[df['eri_hawaiian'] == 1, 'rno_defined'] = 'Haw/Pac Isl.'
    df.loc[df['eri_white'] == 1, 'rno_defined'] = 'White'
    df.loc[df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1) > 1, 'rno_defined'] = 'Two Or More'
    df.loc[df['eri_hispanic'] == 1, 'rno_defined'] = 'Hispanic'
    return df

def np_select(df):
    conditions = [df['eri_hispanic'] == 1,
                  df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
                  df['eri_nat_amer'] == 1,
                  df['eri_asian'] == 1,
                  df['eri_afr_amer'] == 1,
                  df['eri_hawaiian'] == 1,
                  df['eri_white'] == 1]
    outputs = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']
    df['rno_defined'] = np.select(conditions, outputs, 'Other')
    return df


@nb.jit(nopython=True)
def conditional_assignment(arr, res):
    
    length = len(arr)
    for i in range(length):
        if arr[i][3] == 1 :
            res[i] = 'Hispanic'
        elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1 :
            res[i] = 'Two Or More'
        elif arr[i][0]  == 1:
            res[i] = 'Black/AA'
        elif arr[i][1] == 1:
            res[i] = 'Asian'
        elif arr[i][2] == 1:
            res[i] = 'Haw/Pac Isl.'
        elif arr[i][4] == 1 :
            res[i] = 'A/I AK Native'
        elif arr[i][5] == 1:
            res[i] = 'White'
        else:
            res[i] = 'Other'
            
    return res

def nb_loop(df):
    cols = [c for c in df.columns if c.startswith('eri_')]
    res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
    df['rno_defined'] = conditional_assignment(df[cols].values, res)
    return df

# df with 24mil rows
n = 4_000_000
df = pd.DataFrame({
    'eri_afr_amer': [0, 0, 0, 0, 0, 0]*n, 
    'eri_asian': [1, 0, 0, 0, 0, 0]*n, 
    'eri_hawaiian': [0, 0, 0, 1, 0, 0]*n, 
    'eri_hispanic': [0, 1, 0, 0, 1, 0]*n, 
    'eri_nat_amer': [0, 0, 0, 0, 1, 0]*n, 
    'eri_white': [0, 0, 1, 1, 0, 0]*n
}, dtype='int8')
df.insert(0, 'name', ['MOST', 'CRUISE', 'DEPP', 'DICAP', 'BRANDO', 'HANKS']*n)

%timeit nb_loop(df)
# 5.23 s ± 45.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit pd_loc(df)
# 7.97 s ± 28.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit np_select(df)
# 8.5 s ± 39.6 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

2:在下面的结果中,我展示了两种方法的性能,分别使用一个具有20k行和1 mil行的数据框架。对于更小的帧,间隔更小,因为当apply()是一个循环时,优化的方法有一个开销。随着帧大小的增加,向量化开销w.r.t.减少到代码的总体运行时,而apply()仍然是帧上的循环。

n = 20_000 # 1_000_000
df = pd.DataFrame(np.random.rand(n,3)-0.5, columns=['colA','colB','colC'])

%timeit df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])
# n = 20000: 2.69 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# n = 1000000: 86.2 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)
# n = 20000: 679 ms ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# n = 1000000: 31.5 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)