我想将我的自定义函数(它使用if-else阶梯)应用到这六列(ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr。Amer, ERI_HI_PacIsl, ERI_White)在我的数据帧的每一行。

I've tried different methods from other questions but still can't seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can't be counted as anything else. Even if they have a "1" in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can't be counted as a unique ethnicity(except for Hispanic).

这几乎就像对每一行进行for循环,如果每个记录满足一个条件,它们就被添加到一个列表中,并从原始列表中删除。

从下面的数据框架中,我需要根据SQL中的以下规范计算一个新列:

标准

IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”

备注:如果西班牙裔的ERI标志为真(1),则该员工被归类为“西班牙裔”

备注:如果多于1个非西班牙ERI Flag为真,返回" Two or more "

DATAFRAME

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White

当前回答

上面的答案是完全有效的,但是存在一个向量化的解决方案,形式为numpy.select。这允许你定义条件,然后定义这些条件的输出,比使用apply更有效:


首先,定义条件:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

现在,定义相应的输出:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

最后,使用numpy.select:

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

为什么要numpy。选择使用而不是应用?下面是一些性能检查:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用numpy。Select极大地提高了性能,并且随着数据的增长,差异只会增加。

其他回答

根据标准的复杂程度选择方法

对于下面的例子——为了为新列显示多种类型的规则——我们将假设DataFrame包含列“red”、“green”和“blue”,包含从0到1的浮点值。

一般情况:.apply

只要计算新值所需的逻辑可以写成同一行中其他值的函数,我们就可以使用DataFrame的.apply方法来获得所需的结果。编写函数,使其接受单个参数,即输入的单行:

def as_hex(value):
    # clamp to avoid rounding errors etc.
    return min(max(0, int(value * 256)), 255)

def hex_color(row):
    r, g, b = as_hex(row['red']), as_hex(row['green']), as_hex(row['blue'])
    return f'#{r:02x}{g:02x}{b:02x}'

将函数本身(不要在名称后面写圆括号)传递给.apply,并指定axis=1(意味着向分类函数提供行,以便计算列-而不是相反)。因此:

df['hex_color'] = df.apply(hex_color, axis=1)

注意,在lambda中包装是不必要的,因为我们没有绑定任何参数或以其他方式修改函数。

apply步骤是必要的,因为转换函数本身不是向量化的。因此,像df['color'] = hex_color(df)这样的简单方法将不起作用(示例问题)。

这个工具功能强大,但效率低下。为了获得最佳性能,请在适用的情况下使用更具体的方法。

带条件的多种选择:numpy。选择或重复赋值df。Loc或df.where

假设我们对颜色值进行阈值设置,并计算大致的颜色名称,如下所示:

def additive_color(row):
    # Insert here: logic that takes values from the `row` and computes
    # the desired cell value for the new column in that row.
    # The `row` is an ordinary `Series` object representing a row of the
    # original `DataFrame`; it can be indexed with column names, thus:
    if row['red'] > 0.5:
        if row['green'] > 0.5:
            return 'white' if row['blue'] > 0.5 else 'yellow'
        else:
            return 'magenta' if row['blue'] > 0.5 else 'red'
    elif row['green'] > 0.5:
        return 'cyan' if row['blue'] > 0.5 else 'green'
    else:
        return 'blue' if row['blue'] > 0.5 else 'black'

在这种情况下——分类函数是一个if/else阶梯,或者3.10及更高版本中的match/case——我们可以使用numpy.select获得更快的性能。

这种方法的工作原理非常不同。首先,计算每个条件适用的数据掩码:

black = (df['red'] <= 0.5) & (df['green'] <= 0.5) & (df['blue'] <= 0.5)
white = (df['red'] > 0.5) & (df['green'] > 0.5) & (df['blue'] > 0.5)

调用numpy。选择,我们需要两个并行序列——一个是条件,另一个是对应的值:

df['color'] = np.select(
    [white, black],
    ['white', 'black'],
    'colorful'
)

可选的第三个参数指定当所有条件都不满足时使用的值。(作为练习:填写剩下的条件,并尝试不带第三个参数。)

类似的方法是基于每个条件进行重复分配。先指定默认值,然后使用df。Loc为每个条件分配特定值:

df['color'] = 'colorful'
df.loc[white, 'color'] = 'white'
df.loc[black, 'color'] = 'black'

此外,df。在哪里可以用来做作业。然而,df。Where,像这样使用,在条件不满足的地方分配指定的值,因此条件必须颠倒:

df['color'] = 'colorful'
df['color'] = df['color'].where(~white, 'white').where(~black, 'black')

简单的数学操作:内置数学运算符和广播

例如,基于应用程序的方法如下:

def brightness(row):
    return row['red'] * .299 + row['green'] * .587 + row['blue'] * .114

df['brightness'] = df.apply(brightness, axis=1)

相反,可以通过广播操作符来编写,以获得更好的性能(也更简单):

df['brightness'] = df['red'] * .299 + df['green'] * .587 + df['blue'] * .114

作为练习,下面是第一个这样做的例子:

def as_hex(column):
    scaled = (column * 256).astype(int)
    clamped = scaled.where(scaled >= 0, 0).where(scaled <= 255, 255)
    return clamped.apply(lambda i: f'{i:02x}')

df['hex_color'] = '#' + as_hex(df['red']) + as_hex(df['green']) + as_hex(df['blue'])

我无法找到一个向量化的等效格式来将整数值格式化为十六进制字符串,因此.apply仍然在这里内部使用-这意味着全速惩罚仍然发挥作用。不过,这演示了一些通用技术。

更多细节和例子,请看cottontail的回答。

正如@user3483203所指出的,numpy。选择是最好的方法

将条件语句和相应的操作存储在两个列表中

conds = [(df['eri_hispanic'] == 1),(df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1)),(df['eri_nat_amer'] == 1),(df['eri_asian'] == 1),(df['eri_afr_amer'] == 1),(df['eri_hawaiian'] == 1),(df['eri_white'] == 1,])

actions = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']

你现在可以使用np。选择使用这些列表作为参数

df['label_race'] = np.select(conds,actions,default='Other')

参考:https://numpy.org/doc/stable/reference/generated/numpy.select.html

试试这个,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O/P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White 

使用.loc代替apply。

它改进了向量化。

.loc的工作方式很简单,根据条件屏蔽行,对冻结行应用值。

欲了解更多细节,请访问。loc文档

性能指标:

答:接受

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

每循环1.15 s±46.5 ms(平均±标准值7次运行,每循环1次)

我建议的答案:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

每循环24.7 ms±1.7 ms(平均±标准值7次运行,每循环10次)

上面的答案是完全有效的,但是存在一个向量化的解决方案,形式为numpy.select。这允许你定义条件,然后定义这些条件的输出,比使用apply更有效:


首先,定义条件:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

现在,定义相应的输出:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

最后,使用numpy.select:

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

为什么要numpy。选择使用而不是应用?下面是一些性能检查:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用numpy。Select极大地提高了性能,并且随着数据的增长,差异只会增加。

因为这是'pandas new column from others'的第一个谷歌结果,这里有一个简单的例子:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

如果你得到SettingWithCopyWarning,你也可以这样做:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

来源:https://stackoverflow.com/a/12555510/243392

如果你的列名包含空格,你可以使用这样的语法:

df = df.assign(**{'some column name': col.values})

这是apply和assign的文档。