我有两个数据帧df1和df2,其中df2是df1的子集。我如何得到一个新的数据帧(df3),这是两个数据帧之间的差异?
换句话说,一个在df1中所有的行/列都不在df2中的数据帧?
我有两个数据帧df1和df2,其中df2是df1的子集。我如何得到一个新的数据帧(df3),这是两个数据帧之间的差异?
换句话说,一个在df1中所有的行/列都不在df2中的数据帧?
当前回答
定义数据框架:
df1 = pd.DataFrame({
'Name':
['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
'Age':
[23,45,12,34,27,44,28,39,40]
})
df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])
df1
Name Age
0 John 23
1 Mike 45
2 Smith 12
3 Wale 34
4 Marry 27
5 Tom 44
6 Menda 28
7 Bolt 39
8 Yuswa 40
df2
Name Age
0 John 23
2 Smith 12
3 Wale 34
5 Tom 44
6 Menda 28
8 Yuswa 40
两者之间的区别是:
df1[~df1.isin(df2)].dropna()
Name Age
1 Mike 45.0
4 Marry 27.0
7 Bolt 39.0
地点:
isin(df2)返回df1中也在df2中的行。 ~(元素逻辑NOT)在表达式前面对结果求反,因此我们得到df1中不在df2中的元素——两者之间的差值。 .dropna()删除NaN显示所需输出的行
注意:这只适用于len(df1) >= len(df2)。如果df2比df1长,可以反转表达式:df2[~df2.isin(df1)].dropna()
其他回答
import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
'Age':[23,12,34,44,28,40]})
# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)
# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)
# df1
# Age Name
# 0 23 John
# 1 45 Mike
# 2 12 Smith
# 3 34 Wale
# 4 27 Marry
# 5 44 Tom
# 6 28 Menda
# 7 39 Bolt
# 8 40 Yuswa
# df2
# Age Name
# 0 23 John
# 1 12 Smith
# 2 34 Wale
# 3 44 Tom
# 4 28 Menda
# 5 40 Yuswa
# df_1notin2
# Age Name
# 0 45 Mike
# 1 27 Marry
# 2 39 Bolt
nice @liangli的解决方案略有变化,不需要改变现有数据框架的索引:
newdf = df1.drop(df1.join(df2.set_index('Name').index))
使用lambda函数,您可以过滤_merge值为“left_only”的行,以获得df1中df2中缺失的所有行
df3 = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x :x['_merge']=='left_only']
df
另一个可能的解决方案是使用numpy广播:
df1[np.all(~np.all(df1.values == df2.values[:, None], axis=2), axis=0)]
输出:
Name Age
1 Mike 45
4 Marry 27
7 Bolt 39
我在处理副本时遇到了问题,当一边有副本,另一边至少有一个副本时,所以我使用了Counter。集合做一个更好的差异,确保双方有相同的计数。这不会返回副本,但如果双方有相同的计数,则不会返回任何副本。
from collections import Counter
def diff(df1, df2, on=None):
"""
:param on: same as pandas.df.merge(on) (a list of columns)
"""
on = on if on else df1.columns
df1on = df1[on]
df2on = df2[on]
c1 = Counter(df1on.apply(tuple, 'columns'))
c2 = Counter(df2on.apply(tuple, 'columns'))
c1c2 = c1-c2
c2c1 = c2-c1
df1ondf2on = pd.DataFrame(list(c1c2.elements()), columns=on)
df2ondf1on = pd.DataFrame(list(c2c1.elements()), columns=on)
df1df2 = df1.merge(df1ondf2on).drop_duplicates(subset=on)
df2df1 = df2.merge(df2ondf1on).drop_duplicates(subset=on)
return pd.concat([df1df2, df2df1])
> df1 = pd.DataFrame({'a': [1, 1, 3, 4, 4]})
> df2 = pd.DataFrame({'a': [1, 2, 3, 4, 4]})
> diff(df1, df2)
a
0 1
0 2