假设我们在Python Pandas中有一个数据帧,看起来像这样:
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': [u'aball', u'bball', u'cnut', u'fball']})
或者,用表格的形式:
ids vals
aball 1
bball 2
cnut 3
fball 4
如何过滤包含关键字“球”的行?例如,输出应该是:
ids vals
aball 1
bball 2
fball 4
df[df['ids'].str.contains('ball', na = False)] # valid for (at least) pandas version 0.17.1
分步讲解(由内而外):
df['ids'] selects the ids column of the data frame (technically, the object df['ids'] is of type pandas.Series)
df['ids'].str allows us to apply vectorized string methods (e.g., lower, contains) to the Series
df['ids'].str.contains('ball') checks each element of the Series as to whether the element value has the string 'ball' as a substring. The result is a Series of Booleans indicating True or False about the existence of a 'ball' substring.
df[df['ids'].str.contains('ball')] applies the Boolean 'mask' to the dataframe and returns a view containing appropriate records.
na = False removes NA / NaN values from consideration; otherwise a ValueError may be returned.
如果你想将筛选的列设置为一个新索引,你也可以考虑使用.filter;如果你想把它作为一个单独的列,那么str.contains是最好的方法。
假设你有
df = pd.DataFrame({'vals': [1, 2, 3, 4, 5], 'ids': [u'aball', u'bball', u'cnut', u'fball', 'ballxyz']})
ids vals
0 aball 1
1 bball 2
2 cnut 3
3 fball 4
4 ballxyz 5
你的计划是过滤所有行,其中id包含球和设置id为新索引,你可以这样做
df.set_index('ids').filter(like='ball', axis=0)
这给了
vals
ids
aball 1
bball 2
fball 4
ballxyz 5
但是filter也允许你传递一个正则表达式,所以你也可以只过滤那些列条目以ball结尾的行。在这种情况下,你使用
df.set_index('ids').filter(regex='ball$', axis=0)
vals
ids
aball 1
bball 2
fball 4
请注意,现在不包括带有ballxyz的条目,因为它以ball开始,而不以它结束。
如果您想获取所有以ball开头的条目,可以简单使用
df.set_index('ids').filter(regex='^ball', axis=0)
屈服
vals
ids
ballxyz 5
这同样适用于圆柱;然后你需要改变的是轴=0部分。如果您基于列进行过滤,则它将是axis=1。
df[df['ids'].str.contains('ball', na = False)] # valid for (at least) pandas version 0.17.1
分步讲解(由内而外):
df['ids'] selects the ids column of the data frame (technically, the object df['ids'] is of type pandas.Series)
df['ids'].str allows us to apply vectorized string methods (e.g., lower, contains) to the Series
df['ids'].str.contains('ball') checks each element of the Series as to whether the element value has the string 'ball' as a substring. The result is a Series of Booleans indicating True or False about the existence of a 'ball' substring.
df[df['ids'].str.contains('ball')] applies the Boolean 'mask' to the dataframe and returns a view containing appropriate records.
na = False removes NA / NaN values from consideration; otherwise a ValueError may be returned.