如何根据Pandas中某列的值从DataFrame中选择行?
在SQL中,我会使用:
SELECT *
FROM table
WHERE column_name = some_value
如何根据Pandas中某列的值从DataFrame中选择行?
在SQL中,我会使用:
SELECT *
FROM table
WHERE column_name = some_value
当前回答
要添加:您还可以执行df.groupby('column_name').get_group('column_desired_value').reset_index()以生成具有特定值的指定列的新数据帧。例如。,
import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)
b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1)
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)
运行此命令可以:
Original dataframe:
A B
0 foo one
1 bar one
2 foo two
3 bar three
4 foo two
5 bar two
6 foo one
7 foo three
Sub dataframe where B is two:
A B
0 foo two
1 foo two
2 bar two
其他回答
对于Pandas中给定值的多个列中仅选择特定列:
select col_name1, col_name2 from table where column_name = some_value.
选项位置:
df.loc[df['column_name'] == some_value, [col_name1, col_name2]]
或查询:
df.query('column_name == some_value')[[col_name1, col_name2]]
很好的答案。只有当数据帧的大小接近百万行时,许多方法在使用df[df['col']==val]时往往需要很长时间。我希望“another_column”的所有可能值都对应于“some_column“中的特定值(在本例中是在字典中)。这起作用很快。
s=datetime.datetime.now()
my_dict={}
for i, my_key in enumerate(df['some_column'].values):
if i%100==0:
print(i) # to see the progress
if my_key not in my_dict.keys():
my_dict[my_key]={}
my_dict[my_key]['values']=[df.iloc[i]['another_column']]
else:
my_dict[my_key]['values'].append(df.iloc[i]['another_column'])
e=datetime.datetime.now()
print('operation took '+str(e-s)+' seconds')```
要添加:您还可以执行df.groupby('column_name').get_group('column_desired_value').reset_index()以生成具有特定值的指定列的新数据帧。例如。,
import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)
b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1)
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)
运行此命令可以:
Original dataframe:
A B
0 foo one
1 bar one
2 foo two
3 bar three
4 foo two
5 bar two
6 foo one
7 foo three
Sub dataframe where B is two:
A B
0 foo two
1 foo two
2 bar two
使用DuckDB选择行的DataFrames上的SQL语句
使用DuckDB,我们可以用SQL语句以高性能的方式查询panda DataFrames。
由于问题是如何根据列值从DataFrame中选择行?,问题中的示例是一个SQL查询,这个答案在本主题中看起来很合理。
例子:
In [1]: import duckdb
In [2]: import pandas as pd
In [3]: con = duckdb.connect()
In [4]: df = pd.DataFrame({"A": range(11), "B": range(11, 22)})
In [5]: df
Out[5]:
A B
0 0 11
1 1 12
2 2 13
3 3 14
4 4 15
5 5 16
6 6 17
7 7 18
8 8 19
9 9 20
10 10 21
In [6]: results = con.execute("SELECT * FROM df where A > 2").df()
In [7]: results
Out[7]:
A B
0 3 14
1 4 15
2 5 16
3 6 17
4 7 18
5 8 19
6 9 20
7 10 21
如果您想重复查询数据帧,并且速度对您很重要,最好的方法是将数据帧转换为字典,然后通过这样做,您可以将查询速度提高数千倍。
my_df = df.set_index(column_name)
my_dict = my_df.to_dict('index')
制作my_dict字典后,您可以浏览:
if some_value in my_dict.keys():
my_result = my_dict[some_value]
如果column_name中有重复值,则无法创建字典。但您可以使用:
my_result = my_df.loc[some_value]