我试图突出显示两个数据帧之间发生了什么变化。

假设我有两个Python Pandas数据框架:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

我的目标是输出一个HTML表,它:

标识已更改的行(可以是int, float, boolean,字符串) 输出具有相同的OLD和NEW值的行(理想情况下是HTML表),以便消费者可以清楚地看到两个数据框架之间发生了什么变化: “StudentRoster差异Jan-1 - Jan-2”: id名称分数isregistered评论 尼克是1.11|现在1.21假毕业 113佐伊4.12是真的|现在是假的|现在“度假”

我想我可以逐行逐列比较,但有没有更简单的方法?


当前回答

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                           Graduated
113  Zoe    4.12                     True       ''',

         '''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                           Graduated
113  Zoe    4.12                     False                         On vacation''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2]) 

print(df)
#     id  Name  score isEnrolled               Comment
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.11      False             Graduated
# 2  113   Zoe   4.12       True                   NaN
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.21      False             Graduated
# 2  113   Zoe   4.12      False           On vacation

df.set_index(['id', 'Name'], inplace=True)
print(df)
#           score isEnrolled               Comment
# id  Name                                        
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.11      False             Graduated
# 113 Zoe    4.12       True                   NaN
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.21      False             Graduated
# 113 Zoe    4.12      False           On vacation

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)

打印

                score    isEnrolled               Comment
id  Name                                                 
111 Jack         2.17          True  He was late to class
112 Nick  1.11 | 1.21         False             Graduated
113 Zoe          4.12  True | False     nan | On vacation

其他回答

这个答案只是扩展了@Andy Hayden的答案,使其能够适应数值字段为nan的情况,并将其包装成一个函数。

import pandas as pd
import numpy as np


def diff_pd(df1, df2):
    """Identify differences between two pandas DataFrames"""
    assert (df1.columns == df2.columns).all(), \
        "DataFrame column names are different"
    if any(df1.dtypes != df2.dtypes):
        "Data Types are different, trying to convert"
        df2 = df2.astype(df1.dtypes)
    if df1.equals(df2):
        return None
    else:
        # need to account for np.nan != np.nan returning True
        diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
        ne_stacked = diff_mask.stack()
        changed = ne_stacked[ne_stacked]
        changed.index.names = ['id', 'col']
        difference_locations = np.where(diff_mask)
        changed_from = df1.values[difference_locations]
        changed_to = df2.values[difference_locations]
        return pd.DataFrame({'from': changed_from, 'to': changed_to},
                            index=changed.index)

所以对于你的数据(稍微编辑一下,在分数列中有一个NaN):

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
diff_pd(df1, df2)

输出:

                from           to
id  col                          
112 score       1.11         1.21
113 isEnrolled  True        False
    Comment           On vacation

在两个数据帧之间寻找不对称差异的函数实现如下: (基于熊猫的集差) 要点:https://gist.github.com/oneryalcin/68cf25f536a25e65f0b3c84f9c118e03

def diff_df(df1, df2, how="left"):
    """
      Find Difference of rows for given two dataframes
      this function is not symmetric, means
            diff(x, y) != diff(y, x)
      however
            diff(x, y, how='left') == diff(y, x, how='right')

      Ref: https://stackoverflow.com/questions/18180763/set-difference-for-pandas/40209800#40209800
    """
    if (df1.columns != df2.columns).any():
        raise ValueError("Two dataframe columns must match")

    if df1.equals(df2):
        return None
    elif how == 'right':
        return pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
    elif how == 'left':
        return pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
    else:
        raise ValueError('how parameter supports only "left" or "right keywords"')

例子:

df1 = pd.DataFrame(d1)
Out[1]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1             Graduated  Nick       False   1.11
2                         Zoe        True   4.12


df2 = pd.DataFrame(d2)

Out[2]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1           On vacation   Zoe        True   4.12

diff_df(df1, df2)
Out[3]: 
     Comment  Name  isEnrolled  score
1  Graduated  Nick       False   1.11
2              Zoe        True   4.12

diff_df(df2, df1)
Out[4]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

# This gives the same result as above
diff_df(df1, df2, how='right')
Out[22]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

如果两个数据帧中有相同的id,那么找出发生了什么变化实际上是相当容易的。只要执行frame1 != frame2,就会得到一个布尔型的DataFrame,其中每个True都是已更改的数据。由此,您可以通过执行changedids = frame1.index[np。Any (frame1 != frame2,axis=1)]。

使用concat和drop_duplicate的不同方法:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)

df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)

输出:

       Name  score isEnrolled      Comment
  id                                      
1 112  Nick   1.11      False    Graduated
  113   Zoe    NaN       True             
2 112  Nick   1.21      False    Graduated
  113   Zoe    NaN      False  On vacation

pandas >= 1.1: DataFrame.compare

使用pandas 1.1,基本上可以用一个函数调用复制Ted Petrou的输出。例子摘自文档:

pd.__version__
# '1.1.0'

df1.compare(df2)

  score       isEnrolled       Comment             
   self other       self other    self        other
1  1.11  1.21        NaN   NaN     NaN          NaN
2   NaN   NaN        1.0   0.0     NaN  On vacation

这里,“self”指的是LHS数据帧,而“other”指的是RHS数据帧。默认情况下,相等的值将被nan替换,因此您可以只关注差异。如果您想显示相同的值,请使用

df1.compare(df2, keep_equal=True, keep_shape=True) 

  score       isEnrolled           Comment             
   self other       self  other       self        other
1  1.11  1.21      False  False  Graduated    Graduated
2  4.12  4.12       True  False        NaN  On vacation

你也可以使用align_axis改变比较轴:

df1.compare(df2, align_axis='index')

         score  isEnrolled      Comment
1 self    1.11         NaN          NaN
  other   1.21         NaN          NaN
2 self     NaN         1.0          NaN
  other    NaN         0.0  On vacation

这是逐行比较值,而不是逐列比较值。