我试图突出显示两个数据帧之间发生了什么变化。

假设我有两个Python Pandas数据框架:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

我的目标是输出一个HTML表,它:

标识已更改的行(可以是int, float, boolean,字符串) 输出具有相同的OLD和NEW值的行(理想情况下是HTML表),以便消费者可以清楚地看到两个数据框架之间发生了什么变化: “StudentRoster差异Jan-1 - Jan-2”: id名称分数isregistered评论 尼克是1.11|现在1.21假毕业 113佐伊4.12是真的|现在是假的|现在“度假”

我想我可以逐行逐列比较,但有没有更简单的方法?


当前回答

突出显示两个数据框架之间的差异

可以使用DataFrame样式属性来突出显示有差异的单元格的背景颜色。

使用原始问题中的示例数据

第一步是用concat函数水平连接dataframe,并用keys参数区分每一帧:

df_all = pd.concat([df.set_index('id'), df2.set_index('id')], 
                   axis='columns', keys=['First', 'Second'])
df_all

交换列级别并将相同的列名放在彼此旁边可能更容易:

df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final

现在,很容易看出不同的框架。但是,我们可以进一步使用style属性来突出显示不同的单元格。我们定义了一个自定义函数来实现这一点,您可以在本部分文档中看到。

def highlight_diff(data, color='yellow'):
    attr = 'background-color: {}'.format(color)
    other = data.xs('First', axis='columns', level=-1)
    return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
                        index=data.index, columns=data.columns)

df_final.style.apply(highlight_diff, axis=None)

这将突出显示两者都有缺失值的单元格。您可以填充它们或提供额外的逻辑,这样它们就不会被突出显示。

其他回答

下面是另一种使用选择和合并的方法:

In [6]: # first lets create some dummy dataframes with some column(s) different
   ...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)})
   ...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))})


In [7]: df1
Out[7]:
   a   b   c
0 -5  10  20
1 -4  11  21
2 -3  12  22
3 -2  13  23
4 -1  14  24


In [8]: df2
Out[8]:
   a   b    c
0 -5  10   20
1 -4  11  101
2 -3  12  102
3 -2  13  103
4 -1  14  104


In [10]: # make condition over the columns you want to comapre
    ...: condition = df1['c'] != df2['c']
    ...:
    ...: # select rows from each dataframe where the condition holds
    ...: diff1 = df1[condition]
    ...: diff2 = df2[condition]


In [11]: # merge the selected rows (dataframes) with some suffixes (optional)
    ...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after'))
Out[11]:
   a   b  c_before  c_after
0 -4  11        21      101
1 -3  12        22      102
2 -2  13        23      103
3 -1  14        24      104

以下是来自Jupyter的截图:

这个答案只是扩展了@Andy Hayden的答案,使其能够适应数值字段为nan的情况,并将其包装成一个函数。

import pandas as pd
import numpy as np


def diff_pd(df1, df2):
    """Identify differences between two pandas DataFrames"""
    assert (df1.columns == df2.columns).all(), \
        "DataFrame column names are different"
    if any(df1.dtypes != df2.dtypes):
        "Data Types are different, trying to convert"
        df2 = df2.astype(df1.dtypes)
    if df1.equals(df2):
        return None
    else:
        # need to account for np.nan != np.nan returning True
        diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
        ne_stacked = diff_mask.stack()
        changed = ne_stacked[ne_stacked]
        changed.index.names = ['id', 'col']
        difference_locations = np.where(diff_mask)
        changed_from = df1.values[difference_locations]
        changed_to = df2.values[difference_locations]
        return pd.DataFrame({'from': changed_from, 'to': changed_to},
                            index=changed.index)

所以对于你的数据(稍微编辑一下,在分数列中有一个NaN):

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
diff_pd(df1, df2)

输出:

                from           to
id  col                          
112 score       1.11         1.21
113 isEnrolled  True        False
    Comment           On vacation

在两个数据帧之间寻找不对称差异的函数实现如下: (基于熊猫的集差) 要点:https://gist.github.com/oneryalcin/68cf25f536a25e65f0b3c84f9c118e03

def diff_df(df1, df2, how="left"):
    """
      Find Difference of rows for given two dataframes
      this function is not symmetric, means
            diff(x, y) != diff(y, x)
      however
            diff(x, y, how='left') == diff(y, x, how='right')

      Ref: https://stackoverflow.com/questions/18180763/set-difference-for-pandas/40209800#40209800
    """
    if (df1.columns != df2.columns).any():
        raise ValueError("Two dataframe columns must match")

    if df1.equals(df2):
        return None
    elif how == 'right':
        return pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
    elif how == 'left':
        return pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
    else:
        raise ValueError('how parameter supports only "left" or "right keywords"')

例子:

df1 = pd.DataFrame(d1)
Out[1]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1             Graduated  Nick       False   1.11
2                         Zoe        True   4.12


df2 = pd.DataFrame(d2)

Out[2]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1           On vacation   Zoe        True   4.12

diff_df(df1, df2)
Out[3]: 
     Comment  Name  isEnrolled  score
1  Graduated  Nick       False   1.11
2              Zoe        True   4.12

diff_df(df2, df1)
Out[4]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

# This gives the same result as above
diff_df(df1, df2, how='right')
Out[22]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

突出显示两个数据框架之间的差异

可以使用DataFrame样式属性来突出显示有差异的单元格的背景颜色。

使用原始问题中的示例数据

第一步是用concat函数水平连接dataframe,并用keys参数区分每一帧:

df_all = pd.concat([df.set_index('id'), df2.set_index('id')], 
                   axis='columns', keys=['First', 'Second'])
df_all

交换列级别并将相同的列名放在彼此旁边可能更容易:

df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final

现在,很容易看出不同的框架。但是,我们可以进一步使用style属性来突出显示不同的单元格。我们定义了一个自定义函数来实现这一点,您可以在本部分文档中看到。

def highlight_diff(data, color='yellow'):
    attr = 'background-color: {}'.format(color)
    other = data.xs('First', axis='columns', level=-1)
    return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
                        index=data.index, columns=data.columns)

df_final.style.apply(highlight_diff, axis=None)

这将突出显示两者都有缺失值的单元格。您可以填充它们或提供额外的逻辑,这样它们就不会被突出显示。

import pandas as pd
import numpy as np

df = pd.read_excel('D:\\HARISH\\DATA SCIENCE\\1 MY Training\\SAMPLE DATA & projs\\CRICKET DATA\\IPL PLAYER LIST\\IPL PLAYER LIST _ harish.xlsx')


df1= srh = df[df['TEAM'].str.contains("SRH")]
df2 = csk = df[df['TEAM'].str.contains("CSK")]   

srh = srh.iloc[:,0:2]
csk = csk.iloc[:,0:2]

csk = csk.reset_index(drop=True)
csk

srh = srh.reset_index(drop=True)
srh

new = pd.concat([srh, csk], axis=1)

new.head()

** 玩家类型 0 David Warner Batsman…多尼女士,机长 1 Bhuvaneshwar Kumar Bowler…拉文德拉·加德贾是全才 Manish Pandey Batsman…苏雷什·莱纳全能 拉希德·汗·阿尔曼·鲍勒…基达尔·贾达夫全能 4 Shikhar Dhawan Batsman ....多面手Dwayne Bravo