我试图突出显示两个数据帧之间发生了什么变化。

假设我有两个Python Pandas数据框架:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

我的目标是输出一个HTML表,它:

标识已更改的行(可以是int, float, boolean,字符串) 输出具有相同的OLD和NEW值的行(理想情况下是HTML表),以便消费者可以清楚地看到两个数据框架之间发生了什么变化: “StudentRoster差异Jan-1 - Jan-2”: id名称分数isregistered评论 尼克是1.11|现在1.21假毕业 113佐伊4.12是真的|现在是假的|现在“度假”

我想我可以逐行逐列比较,但有没有更简单的方法?


当前回答

扩展@cge的答案,这对于结果的可读性来说非常酷:

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

完整的演示示例:

import numpy as np, pandas as pd

a = pd.DataFrame(np.random.randn(7,3), columns=list('ABC'))
b = a.copy()
b.iloc[0,2] = np.nan
b.iloc[1,0] = 7
b.iloc[3,1] = 77
b.iloc[4,2] = 777

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

结果:样本

在线演示

其他回答

我遇到过这个问题,但在找到这篇文章之前找到了答案:

根据unutbu的回答,加载你的数据…

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                       Date
111  Jack                            True              2013-05-01 12:00:00
112  Nick   1.11                     False             2013-05-12 15:05:23
     Zoe    4.12                     True                                  ''',

         '''\
id   Name   score                    isEnrolled                       Date
111  Jack   2.17                     True              2013-05-01 12:00:00
112  Nick   1.21                     False                                
     Zoe    4.12                     False             2013-05-01 12:00:00''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,17,20], parse_dates=[4])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,17,20], parse_dates=[4])

定义你的diff函数…

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

然后你可以简单地使用Panel来总结:

my_panel = pd.Panel(dict(df1=df1,df2=df2))
print my_panel.apply(report_diff, axis=0)

#          id  Name        score    isEnrolled                       Date
#0        111  Jack   nan | 2.17          True        2013-05-01 12:00:00
#1        112  Nick  1.11 | 1.21         False  2013-05-12 15:05:23 | NaT
#2  nan | nan   Zoe         4.12  True | False  NaT | 2013-05-01 12:00:00

顺便说一下,如果你在IPython Notebook中,你可能喜欢使用彩色差异函数 根据单元格是否不同、相等或左/右为空来给出颜色:

from IPython.display import HTML
pd.options.display.max_colwidth = 500  # You need this, otherwise pandas
#                          will limit your HTML strings to 50 characters

def report_diff(x):
    if x[0]==x[1]:
        return unicode(x[0].__str__())
    elif pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#00ff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', 'nan')
    elif pd.isnull(x[0]) and ~pd.isnull(x[1]):
        return u'<table style="background-color:#ffff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', x[1])
    elif ~pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#0000ff;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0],'nan')
    else:
        return u'<table style="background-color:#ff0000;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0], x[1])

HTML(my_panel.apply(report_diff, axis=0).to_html(escape=False))
import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                           Graduated
113  Zoe    4.12                     True       ''',

         '''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                           Graduated
113  Zoe    4.12                     False                         On vacation''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2]) 

print(df)
#     id  Name  score isEnrolled               Comment
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.11      False             Graduated
# 2  113   Zoe   4.12       True                   NaN
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.21      False             Graduated
# 2  113   Zoe   4.12      False           On vacation

df.set_index(['id', 'Name'], inplace=True)
print(df)
#           score isEnrolled               Comment
# id  Name                                        
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.11      False             Graduated
# 113 Zoe    4.12       True                   NaN
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.21      False             Graduated
# 113 Zoe    4.12      False           On vacation

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)

打印

                score    isEnrolled               Comment
id  Name                                                 
111 Jack         2.17          True  He was late to class
112 Nick  1.11 | 1.21         False             Graduated
113 Zoe          4.12  True | False     nan | On vacation

使用concat和drop_duplicate的不同方法:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)

df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)

输出:

       Name  score isEnrolled      Comment
  id                                      
1 112  Nick   1.11      False    Graduated
  113   Zoe    NaN       True             
2 112  Nick   1.21      False    Graduated
  113   Zoe    NaN      False  On vacation

下面是另一种使用选择和合并的方法:

In [6]: # first lets create some dummy dataframes with some column(s) different
   ...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)})
   ...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))})


In [7]: df1
Out[7]:
   a   b   c
0 -5  10  20
1 -4  11  21
2 -3  12  22
3 -2  13  23
4 -1  14  24


In [8]: df2
Out[8]:
   a   b    c
0 -5  10   20
1 -4  11  101
2 -3  12  102
3 -2  13  103
4 -1  14  104


In [10]: # make condition over the columns you want to comapre
    ...: condition = df1['c'] != df2['c']
    ...:
    ...: # select rows from each dataframe where the condition holds
    ...: diff1 = df1[condition]
    ...: diff2 = df2[condition]


In [11]: # merge the selected rows (dataframes) with some suffixes (optional)
    ...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after'))
Out[11]:
   a   b  c_before  c_after
0 -4  11        21      101
1 -3  12        22      102
2 -2  13        23      103
3 -1  14        24      104

以下是来自Jupyter的截图:

第一部分类似于Constantine,你可以得到哪个行是空的布尔值*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

然后我们可以看到哪些条目发生了变化:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

这里的第一个条目是索引,第二个条目是已更改的列。

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

*注意:重要的是df1和df2在这里共享相同的索引。为了克服这种模糊性,可以使用df1确保只查看共享标签。Index & df2。索引,但我还是把它留作练习吧。