我试图突出显示两个数据帧之间发生了什么变化。

假设我有两个Python Pandas数据框架:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

我的目标是输出一个HTML表,它:

标识已更改的行(可以是int, float, boolean,字符串) 输出具有相同的OLD和NEW值的行(理想情况下是HTML表),以便消费者可以清楚地看到两个数据框架之间发生了什么变化: “StudentRoster差异Jan-1 - Jan-2”: id名称分数isregistered评论 尼克是1.11|现在1.21假毕业 113佐伊4.12是真的|现在是假的|现在“度假”

我想我可以逐行逐列比较,但有没有更简单的方法?


当前回答

使用concat和drop_duplicate的不同方法:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)

df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)

输出:

       Name  score isEnrolled      Comment
  id                                      
1 112  Nick   1.11      False    Graduated
  113   Zoe    NaN       True             
2 112  Nick   1.21      False    Graduated
  113   Zoe    NaN      False  On vacation

其他回答

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                           Graduated
113  Zoe    4.12                     True       ''',

         '''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                           Graduated
113  Zoe    4.12                     False                         On vacation''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2]) 

print(df)
#     id  Name  score isEnrolled               Comment
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.11      False             Graduated
# 2  113   Zoe   4.12       True                   NaN
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.21      False             Graduated
# 2  113   Zoe   4.12      False           On vacation

df.set_index(['id', 'Name'], inplace=True)
print(df)
#           score isEnrolled               Comment
# id  Name                                        
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.11      False             Graduated
# 113 Zoe    4.12       True                   NaN
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.21      False             Graduated
# 113 Zoe    4.12      False           On vacation

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)

打印

                score    isEnrolled               Comment
id  Name                                                 
111 Jack         2.17          True  He was late to class
112 Nick  1.11 | 1.21         False             Graduated
113 Zoe          4.12  True | False     nan | On vacation

在摆弄了@journois的答案后,由于Panel的弃用,我能够使用MultiIndex而不是Panel来工作。

首先,创建一些虚拟数据:

df1 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '555'],
    'let': ['a', 'b', 'c', 'd', 'e'],
    'num': ['1', '2', '3', '4', '5']
})
df2 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '666'],
    'let': ['a', 'b', 'c', 'D', 'f'],
    'num': ['1', '2', 'Three', '4', '6'],
})

然后,定义你的diff函数,在这种情况下,我将使用他的答案report_diff保持不变:

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

然后,我将数据连接到一个MultiIndex数据框架:

df_all = pd.concat(
    [df1.set_index('id'), df2.set_index('id')], 
    axis='columns', 
    keys=['df1', 'df2'],
    join='outer'
)
df_all = df_all.swaplevel(axis='columns')[df1.columns[1:]]

最后,我将对每个列组应用report_diff:

df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))

这个输出:

         let        num
111        a          1
222        b          2
333        c  3 | Three
444    d | D          4
555  e | nan    5 | nan
666  nan | f    nan | 6

这就是全部!

扩展@cge的答案,这对于结果的可读性来说非常酷:

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

完整的演示示例:

import numpy as np, pandas as pd

a = pd.DataFrame(np.random.randn(7,3), columns=list('ABC'))
b = a.copy()
b.iloc[0,2] = np.nan
b.iloc[1,0] = 7
b.iloc[3,1] = 77
b.iloc[4,2] = 777

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

结果:样本

在线演示

第一部分类似于Constantine,你可以得到哪个行是空的布尔值*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

然后我们可以看到哪些条目发生了变化:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

这里的第一个条目是索引,第二个条目是已更改的列。

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

*注意:重要的是df1和df2在这里共享相同的索引。为了克服这种模糊性,可以使用df1确保只查看共享标签。Index & df2。索引,但我还是把它留作练习吧。

突出显示两个数据框架之间的差异

可以使用DataFrame样式属性来突出显示有差异的单元格的背景颜色。

使用原始问题中的示例数据

第一步是用concat函数水平连接dataframe,并用keys参数区分每一帧:

df_all = pd.concat([df.set_index('id'), df2.set_index('id')], 
                   axis='columns', keys=['First', 'Second'])
df_all

交换列级别并将相同的列名放在彼此旁边可能更容易:

df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final

现在,很容易看出不同的框架。但是,我们可以进一步使用style属性来突出显示不同的单元格。我们定义了一个自定义函数来实现这一点,您可以在本部分文档中看到。

def highlight_diff(data, color='yellow'):
    attr = 'background-color: {}'.format(color)
    other = data.xs('First', axis='columns', level=-1)
    return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
                        index=data.index, columns=data.columns)

df_final.style.apply(highlight_diff, axis=None)

这将突出显示两者都有缺失值的单元格。您可以填充它们或提供额外的逻辑,这样它们就不会被突出显示。