我如何使用apply()函数的单列?

给定以下数据帧df和函数complex_function，

import pandas as pd

def complex_function(x, y=0):
    if x > 5 and x > y:
        return 1
    else:
        return 2

df = pd.DataFrame(data={'col1': [1, 4, 6, 2, 7], 'col2': [6, 7, 1, 2, 8]})

   col1  col2
0     1     6
1     4     7
2     6     1
3     2     2
4     7     8

有几种解决方案可以只在一个列上使用apply()。下面我将详细解释它们。

一、简单解决方案

直截了当的解决方案来自@Fabio Lamanna:

df['col1'] = df['col1'].apply(complex_function)

输出:

   col1  col2
0     2     6
1     2     7
2     1     1
3     2     2
4     1     8

只修改了第一列，第二列保持不变。解决方案很漂亮。它只有一行代码，读起来几乎像英文:“取‘col1’并将函数complex_function应用于它。”

然而，如果您需要来自另一列的数据，例如。col2，行不通。如果你想将“col2”的值传递给complex_function的变量y，你需要其他东西。

2使用整个数据框架的解决方案

或者，你可以使用整个数据框架，如这篇SO文章或这篇文章所述:

df['col1'] = df.apply(lambda x: complex_function(x['col1']), axis=1)

或者如果你喜欢(像我一样)没有lambda函数的解决方案:

def apply_complex_function(x):
    return complex_function(x['col1'])
df['col1'] = df.apply(apply_complex_function, axis=1)

这个解决方案中有很多需要解释的地方。apply()函数作用于pd。Series和pd.DataFrame。但是您不能使用df['col1'] = df.apply(complex_function)。loc[:， 'col1']，因为它会抛出ValueError。

因此，您需要给出要使用哪一列的信息。更复杂的是，apply()函数只接受可调用对象。要解决这个问题，你需要定义一个(lambda)函数，列x['col1']作为参数;也就是说，我们将列信息包装在另一个函数中。

不幸的是，axis参数的默认值是0 (axis=0)，这意味着它将尝试按列而不是按行执行。这在第一个解决方案中不是问题，因为我们给apply()一个pd.Series。但现在输入是一个数据框架，我们必须显式(轴=1)。(我很惊讶我经常忘记这一点。)

你是否喜欢带有函数的版本是主观的。在我看来，即使没有lambda函数，这行代码也足够复杂。您只需要(lambda)函数作为包装器。它只是一个样板代码。读者不应该为此烦恼。

现在，你可以很容易地修改这个解决方案来考虑第二列:

def apply_complex_function(x):
    return complex_function(x['col1'], x['col2'])
df['col1'] = df.apply(apply_complex_function, axis=1)

输出:

   col1  col2
0     2     6
1     2     7
2     1     1
3     2     2
4     2     8

在索引4处，值从1变为2，因为第一个条件7 > 5为真，而第二个条件7 > 8为假。

注意，你只需要改变第一行代码(即函数)，而不是第二行。

边注

不要将列信息放入函数中。

def bad_idea(x):
    return x['col1'] ** 2

通过这样做，您可以使一个通用函数依赖于列名!这是一个坏主意，因为下次想要使用这个函数时，就不能使用了。更糟糕的是:也许你重命名了不同数据框架中的一列，只是为了让它与你现有的函数一起工作。(我也经历过。这是一个滑坡!)

3不使用apply()的替代解决方案

尽管OP特别要求使用apply()提供解决方案，但建议了替代解决方案。例如，@George Petrov的回答建议使用map();@Thibaut Dubernet的答案建议assign()。

我完全同意apply()很少是最佳解决方案，因为apply()不是向量化的。这是一个基于元素的操作，具有昂贵的函数调用和pd.Series的开销。

使用apply()的一个原因是希望使用现有函数，而性能不是问题。或者你的函数太复杂以至于没有向量化的版本存在。

使用apply()的另一个原因是与groupby()结合使用。请注意，datafframe .apply()和GroupBy.apply()是不同的函数。

所以考虑一些替代方案是有意义的:

map() only works on pd.Series, but accepts dict and pd.Series as input. Using map() with a function is almost interchangeable with using apply(). It can be faster than apply(). See this SO post for more details. df['col1'] = df['col1'].map(complex_function) applymap() is almost identical for dataframes. It does not support pd.Series and it will always return a dataframe. However, it can be faster. The documentation states: "In the current implementation applymap calls func twice on the first column/row to decide whether it can take a fast or slow code path.". But if performance really counts you should seek an alternative route. df['col1'] = df.applymap(complex_function).loc[:, 'col1'] assign() is not a feasible replacement for apply(). It has a similar behaviour in only the most basic use cases. It does not work with the complex_function. You still need apply() as you can see in the example below. The main use case for assign() is method chaining, because it gives back the dataframe without changing the original dataframe. df['col1'] = df.assign(col1=df.col1.apply(complex_function))

附件:如何加速apply()?

我只是在这里提到它，因为它是由其他答案建议的，例如@durjoy。这份清单并不详尽:

Do not use apply(). This is no joke. For most numeric operations, a vectorized method exists in pandas. If/else blocks can often be refactored with a combination of boolean indexing and .loc. My example complex_function could be refactored in this way. Refactor to Cython. If you have a complex equation and the parameters of the equation are in your dataframe, this might be a good idea. Check out the official pandas user guide for more information. Use raw=True parameter. Theoretically, this should improve the performance of apply() if you are just applying a NumPy reduction function, because the overhead of pd.Series is removed. Of course, your function has to accept an ndarray. You have to refactor your function to NumPy. By doing this, you will have a huge performance boost. Use 3rd party packages. The first thing you should try is Numba. I do not know swifter mentioned by @durjoy; and probably many other packages are worth mentioning here. Try/Fail/Repeat. As mentioned above, map() and applymap() can be faster - depending on the use case. Just time the different versions and choose the fastest. This approach is the most tedious one with the least performance increase.

2020-07-18 12:01:22

给定一个样本数据帧df为:

你想要的是:

df['a'] = df['a'].apply(lambda x: x + 1)

返回:

2016-01-23 10:15:49

对于单列最好使用map()，如下所示:

df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])

    a   b  c
0  15  15  5
1  20  10  7
2  25  30  9



df['a'] = df['a'].map(lambda a: a / 2.)

      a   b  c
0   7.5  15  5
1  10.0  10  7
2  12.5  30  9

2016-01-23 10:49:31

让我使用datetime并考虑null或空格来尝试一个复杂的计算。我在一个datetime列上减少30年，并使用apply方法以及lambda和转换datetime格式。行if x != "否则x将相应地处理所有空格或null。

df['Date'] = df['Date'].fillna('')
df['Date'] = df['Date'].apply(lambda x : ((datetime.datetime.strptime(str(x), '%m/%d/%Y') - datetime.timedelta(days=30*365)).strftime('%Y%m%d')) if x != '' else x)

2020-02-14 15:12:37

给定以下数据帧df和函数complex_function，

import pandas as pd

def complex_function(x, y=0):
    if x > 5 and x > y:
        return 1
    else:
        return 2

df = pd.DataFrame(data={'col1': [1, 4, 6, 2, 7], 'col2': [6, 7, 1, 2, 8]})

   col1  col2
0     1     6
1     4     7
2     6     1
3     2     2
4     7     8

有几种解决方案可以只在一个列上使用apply()。下面我将详细解释它们。

一、简单解决方案

直截了当的解决方案来自@Fabio Lamanna:

df['col1'] = df['col1'].apply(complex_function)

输出:

   col1  col2
0     2     6
1     2     7
2     1     1
3     2     2
4     1     8

只修改了第一列，第二列保持不变。解决方案很漂亮。它只有一行代码，读起来几乎像英文:“取‘col1’并将函数complex_function应用于它。”

然而，如果您需要来自另一列的数据，例如。col2，行不通。如果你想将“col2”的值传递给complex_function的变量y，你需要其他东西。

2使用整个数据框架的解决方案

或者，你可以使用整个数据框架，如这篇SO文章或这篇文章所述:

df['col1'] = df.apply(lambda x: complex_function(x['col1']), axis=1)

或者如果你喜欢(像我一样)没有lambda函数的解决方案:

def apply_complex_function(x):
    return complex_function(x['col1'])
df['col1'] = df.apply(apply_complex_function, axis=1)

这个解决方案中有很多需要解释的地方。apply()函数作用于pd。Series和pd.DataFrame。但是您不能使用df['col1'] = df.apply(complex_function)。loc[:， 'col1']，因为它会抛出ValueError。

因此，您需要给出要使用哪一列的信息。更复杂的是，apply()函数只接受可调用对象。要解决这个问题，你需要定义一个(lambda)函数，列x['col1']作为参数;也就是说，我们将列信息包装在另一个函数中。

不幸的是，axis参数的默认值是0 (axis=0)，这意味着它将尝试按列而不是按行执行。这在第一个解决方案中不是问题，因为我们给apply()一个pd.Series。但现在输入是一个数据框架，我们必须显式(轴=1)。(我很惊讶我经常忘记这一点。)

你是否喜欢带有函数的版本是主观的。在我看来，即使没有lambda函数，这行代码也足够复杂。您只需要(lambda)函数作为包装器。它只是一个样板代码。读者不应该为此烦恼。

现在，你可以很容易地修改这个解决方案来考虑第二列:

def apply_complex_function(x):
    return complex_function(x['col1'], x['col2'])
df['col1'] = df.apply(apply_complex_function, axis=1)

输出:

   col1  col2
0     2     6
1     2     7
2     1     1
3     2     2
4     2     8

在索引4处，值从1变为2，因为第一个条件7 > 5为真，而第二个条件7 > 8为假。

注意，你只需要改变第一行代码(即函数)，而不是第二行。

边注

不要将列信息放入函数中。

def bad_idea(x):
    return x['col1'] ** 2

通过这样做，您可以使一个通用函数依赖于列名!这是一个坏主意，因为下次想要使用这个函数时，就不能使用了。更糟糕的是:也许你重命名了不同数据框架中的一列，只是为了让它与你现有的函数一起工作。(我也经历过。这是一个滑坡!)

3不使用apply()的替代解决方案

尽管OP特别要求使用apply()提供解决方案，但建议了替代解决方案。例如，@George Petrov的回答建议使用map();@Thibaut Dubernet的答案建议assign()。

我完全同意apply()很少是最佳解决方案，因为apply()不是向量化的。这是一个基于元素的操作，具有昂贵的函数调用和pd.Series的开销。

使用apply()的一个原因是希望使用现有函数，而性能不是问题。或者你的函数太复杂以至于没有向量化的版本存在。

使用apply()的另一个原因是与groupby()结合使用。请注意，datafframe .apply()和GroupBy.apply()是不同的函数。

所以考虑一些替代方案是有意义的:

map() only works on pd.Series, but accepts dict and pd.Series as input. Using map() with a function is almost interchangeable with using apply(). It can be faster than apply(). See this SO post for more details. df['col1'] = df['col1'].map(complex_function) applymap() is almost identical for dataframes. It does not support pd.Series and it will always return a dataframe. However, it can be faster. The documentation states: "In the current implementation applymap calls func twice on the first column/row to decide whether it can take a fast or slow code path.". But if performance really counts you should seek an alternative route. df['col1'] = df.applymap(complex_function).loc[:, 'col1'] assign() is not a feasible replacement for apply(). It has a similar behaviour in only the most basic use cases. It does not work with the complex_function. You still need apply() as you can see in the example below. The main use case for assign() is method chaining, because it gives back the dataframe without changing the original dataframe. df['col1'] = df.assign(col1=df.col1.apply(complex_function))

附件:如何加速apply()?

我只是在这里提到它，因为它是由其他答案建议的，例如@durjoy。这份清单并不详尽:

Do not use apply(). This is no joke. For most numeric operations, a vectorized method exists in pandas. If/else blocks can often be refactored with a combination of boolean indexing and .loc. My example complex_function could be refactored in this way. Refactor to Cython. If you have a complex equation and the parameters of the equation are in your dataframe, this might be a good idea. Check out the official pandas user guide for more information. Use raw=True parameter. Theoretically, this should improve the performance of apply() if you are just applying a NumPy reduction function, because the overhead of pd.Series is removed. Of course, your function has to accept an ndarray. You have to refactor your function to NumPy. By doing this, you will have a huge performance boost. Use 3rd party packages. The first thing you should try is Numba. I do not know swifter mentioned by @durjoy; and probably many other packages are worth mentioning here. Try/Fail/Repeat. As mentioned above, map() and applymap() can be faster - depending on the use case. Just time the different versions and choose the fastest. This approach is the most tedious one with the least performance increase.

2020-07-18 12:01:22

你根本不需要函数。您可以直接处理整个列。

示例数据:

>>> df = pd.DataFrame({'a': [100, 1000], 'b': [200, 2000], 'c': [300, 3000]})
>>> df

      a     b     c
0   100   200   300
1  1000  2000  3000

a列中所有值的一半:

>>> df.a = df.a / 2
>>> df

     a     b     c
0   50   200   300
1  500  2000  3000

2016-01-23 10:58:30

我如何使用apply()函数的单列?

推荐文章

最新文章

标签