我有一个场景,用户想要对Pandas DataFrame或Series对象应用几个过滤器。从本质上讲,我希望有效地将用户在运行时指定的一系列过滤(比较操作)链接在一起。

The filters should be additive (aka each one applied should narrow results). I'm currently using reindex() (as below) but this creates a new object each time and copies the underlying data (if I understand the documentation correctly). I want to avoid this unnecessary copying as it will be really inefficient when filtering a big Series or DataFrame. I'm thinking that using apply(), map(), or something similar might be better. I'm pretty new to Pandas though so still trying to wrap my head around everything. Also, I would like to expand this so that the dictionary passed in can include the columns to operate on and filter an entire DataFrame based on the input dictionary. However, I'm assuming whatever works for a Series can be easily expanded to a DataFrame.

博士TL;

我想采用以下形式的字典,并将每个操作应用到给定的Series对象,并返回一个“筛选”Series对象。

relops = {'>=': [1], '<=': [1]}

长时间的例子

我将从当前的一个示例开始,只过滤一个Series对象。下面是我目前使用的函数:

   def apply_relops(series, relops):
        """
        Pass dictionary of relational operators to perform on given series object
        """
        for op, vals in relops.iteritems():
            op_func = ops[op]
            for val in vals:
                filtered = op_func(series, val)
                series = series.reindex(series[filtered])
        return series

用户提供了一个包含他们想要执行的操作的字典:

>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
   col1  col2
0     0    10
1     1    11
2     2    12

>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1       1
2       2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1       1
Name: col1

同样,我上述方法的“问题”是,我认为中间步骤可能存在大量不必要的数据复制。


当前回答

自从熊猫0.22更新以来,比较选项是可用的:

Gt(大于) Lt(小于) Eq(等于) Ne(不等于) Ge(大于或等于)

还有更多。这些函数返回布尔数组。让我们看看如何使用它们:

# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})

# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']

1    1
2    2
3    3
4    4
5    5

# where co11 values is between 0 and 2
df.loc[df['col1'].between(0,2)]

 col1 col2
0   0   10
1   1   11
2   2   12

# where col1 > 1
df.loc[df['col1'].gt(1)]

 col1 col2
2   2   12
3   3   13
4   4   14
5   5   15

其他回答

最简单的解决方案:

Use:

filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]

另一个例子,要过滤数据帧的值属于2018年2月,使用下面的代码

filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]

E还可以基于不在列表或任何可迭代对象中的列的值选择行。我们将像以前一样创建布尔变量,但是现在我们将通过在前面放置~来对布尔变量求反。

例如

list = [1, 0]
df[df.col1.isin(list)]

如果你想检查任意/所有的多个列的值,你可以这样做:

df[(df[['HomeTeam', 'AwayTeam']] == 'Fulham').any(axis=1)]

为什么不这样做呢?

def filt_spec(df, col, val, op):
    import operator
    ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
    return df[ops[op](df[col], val)]
pandas.DataFrame.filt_spec = filt_spec

演示:

df = pd.DataFrame({'a': [1,2,3,4,5], 'b':[5,4,3,2,1]})
df.filt_spec('a', 2, 'ge')

结果:

   a  b
 1  2  4
 2  3  3
 3  4  2
 4  5  1

您可以看到列“a”已被过滤,其中>=2。

这比操作符链接稍微快一点(输入时间,而不是性能)。当然,您可以将导入放在文件的顶部。

自从熊猫0.22更新以来,比较选项是可用的:

Gt(大于) Lt(小于) Eq(等于) Ne(不等于) Ge(大于或等于)

还有更多。这些函数返回布尔数组。让我们看看如何使用它们:

# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})

# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']

1    1
2    2
3    3
4    4
5    5

# where co11 values is between 0 and 2
df.loc[df['col1'].between(0,2)]

 col1 col2
0   0   10
1   1   11
2   2   12

# where col1 > 1
df.loc[df['col1'].gt(1)]

 col1 col2
2   2   12
3   3   13
4   4   14
5   5   15