如何根据Pandas中某列的值从DataFrame中选择行?
在SQL中,我会使用:
SELECT *
FROM table
WHERE column_name = some_value
如何根据Pandas中某列的值从DataFrame中选择行?
在SQL中,我会使用:
SELECT *
FROM table
WHERE column_name = some_value
当前回答
tl;博士
熊猫相当于
select * from table where column_name = some_value
is
table[table.column_name == some_value]
多种条件:
table[(table.column_name == some_value) | (table.column_name2 == some_value2)]
or
table.query('column_name == some_value | column_name2 == some_value2')
代码示例
import pandas as pd
# Create data set
d = {'foo':[100, 111, 222],
'bar':[333, 444, 555]}
df = pd.DataFrame(d)
# Full dataframe:
df
# Shows:
# bar foo
# 0 333 100
# 1 444 111
# 2 555 222
# Output only the row(s) in df where foo is 222:
df[df.foo == 222]
# Shows:
# bar foo
# 2 555 222
在上面的代码中,是df[df.foo==222]行根据列值给出行,在本例中为222。
也可能出现多种情况:
df[(df.foo == 222) | (df.bar == 444)]
# bar foo
# 1 444 111
# 2 555 222
但在这一点上,我建议使用查询函数,因为它不那么冗长,并产生相同的结果:
df.query('foo == 222 | bar == 444')
其他回答
tl;博士
熊猫相当于
select * from table where column_name = some_value
is
table[table.column_name == some_value]
多种条件:
table[(table.column_name == some_value) | (table.column_name2 == some_value2)]
or
table.query('column_name == some_value | column_name2 == some_value2')
代码示例
import pandas as pd
# Create data set
d = {'foo':[100, 111, 222],
'bar':[333, 444, 555]}
df = pd.DataFrame(d)
# Full dataframe:
df
# Shows:
# bar foo
# 0 333 100
# 1 444 111
# 2 555 222
# Output only the row(s) in df where foo is 222:
df[df.foo == 222]
# Shows:
# bar foo
# 2 555 222
在上面的代码中,是df[df.foo==222]行根据列值给出行,在本例中为222。
也可能出现多种情况:
df[(df.foo == 222) | (df.bar == 444)]
# bar foo
# 1 444 111
# 2 555 222
但在这一点上,我建议使用查询函数,因为它不那么冗长,并产生相同的结果:
df.query('foo == 222 | bar == 444')
使用numpy.where可以获得更快的结果。
例如,使用unubtu的设置-
In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]:
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
时间比较:
In [68]: %timeit df.iloc[np.where(df.A.values=='foo')] # fastest
1000 loops, best of 3: 380 µs per loop
In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 µs per loop
In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 µs per loop
In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 µs per loop
In [74]: %timeit df.query('(A=="foo")') # slowest
1000 loops, best of 3: 1.71 ms per loop
您可以在函数中使用loc(方括号):
# Series
s = pd.Series([1, 2, 3, 4])
s.loc[lambda x: x > 1]
# s[lambda x: x > 1]
输出:
1 2
2 3
3 4
dtype: int64
or
# DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
df.loc[lambda x: x['A'] > 1]
# df[lambda x: x['A'] > 1]
输出:
A B
1 2 20
2 3 30
对于Pandas中给定值的多个列中仅选择特定列:
select col_name1, col_name2 from table where column_name = some_value.
选项位置:
df.loc[df['column_name'] == some_value, [col_name1, col_name2]]
或查询:
df.query('column_name == some_value')[[col_name1, col_name2]]
如果您想重复查询数据帧,并且速度对您很重要,最好的方法是将数据帧转换为字典,然后通过这样做,您可以将查询速度提高数千倍。
my_df = df.set_index(column_name)
my_dict = my_df.to_dict('index')
制作my_dict字典后,您可以浏览:
if some_value in my_dict.keys():
my_result = my_dict[some_value]
如果column_name中有重复值,则无法创建字典。但您可以使用:
my_result = my_df.loc[some_value]