我试图突出显示两个数据帧之间发生了什么变化。
假设我有两个Python Pandas数据框架:
"StudentRoster Jan-1":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.11 False Graduated
113 Zoe 4.12 True
"StudentRoster Jan-2":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.21 False Graduated
113 Zoe 4.12 False On vacation
我的目标是输出一个HTML表,它:
标识已更改的行(可以是int, float, boolean,字符串)
输出具有相同的OLD和NEW值的行(理想情况下是HTML表),以便消费者可以清楚地看到两个数据框架之间发生了什么变化:
“StudentRoster差异Jan-1 - Jan-2”:
id名称分数isregistered评论
尼克是1.11|现在1.21假毕业
113佐伊4.12是真的|现在是假的|现在“度假”
我想我可以逐行逐列比较,但有没有更简单的方法?
import pandas as pd
import numpy as np
df = pd.read_excel('D:\\HARISH\\DATA SCIENCE\\1 MY Training\\SAMPLE DATA & projs\\CRICKET DATA\\IPL PLAYER LIST\\IPL PLAYER LIST _ harish.xlsx')
df1= srh = df[df['TEAM'].str.contains("SRH")]
df2 = csk = df[df['TEAM'].str.contains("CSK")]
srh = srh.iloc[:,0:2]
csk = csk.iloc[:,0:2]
csk = csk.reset_index(drop=True)
csk
srh = srh.reset_index(drop=True)
srh
new = pd.concat([srh, csk], axis=1)
new.head()
**
玩家类型
0 David Warner Batsman…多尼女士,机长
1 Bhuvaneshwar Kumar Bowler…拉文德拉·加德贾是全才
Manish Pandey Batsman…苏雷什·莱纳全能
拉希德·汗·阿尔曼·鲍勒…基达尔·贾达夫全能
4 Shikhar Dhawan Batsman ....多面手Dwayne Bravo
第一部分类似于Constantine,你可以得到哪个行是空的布尔值*:
In [21]: ne = (df1 != df2).any(1)
In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool
然后我们可以看到哪些条目发生了变化:
In [23]: ne_stacked = (df1 != df2).stack()
In [24]: changed = ne_stacked[ne_stacked]
In [25]: changed.index.names = ['id', 'col']
In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool
这里的第一个条目是索引,第二个条目是已更改的列。
In [27]: difference_locations = np.where(df1 != df2)
In [28]: changed_from = df1.values[difference_locations]
In [29]: changed_to = df2.values[difference_locations]
In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation
*注意:重要的是df1和df2在这里共享相同的索引。为了克服这种模糊性,可以使用df1确保只查看共享标签。Index & df2。索引,但我还是把它留作练习吧。
扩展@cge的答案,这对于结果的可读性来说非常酷:
a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
b[a != b][np.any(a != b, axis=1)]
,rsuffix='_b', how='outer'
).fillna('')
完整的演示示例:
import numpy as np, pandas as pd
a = pd.DataFrame(np.random.randn(7,3), columns=list('ABC'))
b = a.copy()
b.iloc[0,2] = np.nan
b.iloc[1,0] = 7
b.iloc[3,1] = 77
b.iloc[4,2] = 777
a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
b[a != b][np.any(a != b, axis=1)]
,rsuffix='_b', how='outer'
).fillna('')
结果:样本
在线演示
下面是另一种使用选择和合并的方法:
In [6]: # first lets create some dummy dataframes with some column(s) different
...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)})
...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))})
In [7]: df1
Out[7]:
a b c
0 -5 10 20
1 -4 11 21
2 -3 12 22
3 -2 13 23
4 -1 14 24
In [8]: df2
Out[8]:
a b c
0 -5 10 20
1 -4 11 101
2 -3 12 102
3 -2 13 103
4 -1 14 104
In [10]: # make condition over the columns you want to comapre
...: condition = df1['c'] != df2['c']
...:
...: # select rows from each dataframe where the condition holds
...: diff1 = df1[condition]
...: diff2 = df2[condition]
In [11]: # merge the selected rows (dataframes) with some suffixes (optional)
...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after'))
Out[11]:
a b c_before c_after
0 -4 11 21 101
1 -3 12 22 102
2 -2 13 23 103
3 -1 14 24 104
以下是来自Jupyter的截图:
import pandas as pd
import numpy as np
df = pd.read_excel('D:\\HARISH\\DATA SCIENCE\\1 MY Training\\SAMPLE DATA & projs\\CRICKET DATA\\IPL PLAYER LIST\\IPL PLAYER LIST _ harish.xlsx')
df1= srh = df[df['TEAM'].str.contains("SRH")]
df2 = csk = df[df['TEAM'].str.contains("CSK")]
srh = srh.iloc[:,0:2]
csk = csk.iloc[:,0:2]
csk = csk.reset_index(drop=True)
csk
srh = srh.reset_index(drop=True)
srh
new = pd.concat([srh, csk], axis=1)
new.head()
**
玩家类型
0 David Warner Batsman…多尼女士,机长
1 Bhuvaneshwar Kumar Bowler…拉文德拉·加德贾是全才
Manish Pandey Batsman…苏雷什·莱纳全能
拉希德·汗·阿尔曼·鲍勒…基达尔·贾达夫全能
4 Shikhar Dhawan Batsman ....多面手Dwayne Bravo