我试图突出显示两个数据帧之间发生了什么变化。

假设我有两个Python Pandas数据框架:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

我的目标是输出一个HTML表,它:

标识已更改的行(可以是int, float, boolean,字符串) 输出具有相同的OLD和NEW值的行(理想情况下是HTML表),以便消费者可以清楚地看到两个数据框架之间发生了什么变化: “StudentRoster差异Jan-1 - Jan-2”: id名称分数isregistered评论 尼克是1.11|现在1.21假毕业 113佐伊4.12是真的|现在是假的|现在“度假”

我想我可以逐行逐列比较,但有没有更简单的方法?


当前回答

如果您发现这个线程试图在测试中比较数据名称,那么请查看assert_frame_equal方法:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.testing.assert_frame_equal.html

其他回答

在两个数据帧之间寻找不对称差异的函数实现如下: (基于熊猫的集差) 要点:https://gist.github.com/oneryalcin/68cf25f536a25e65f0b3c84f9c118e03

def diff_df(df1, df2, how="left"):
    """
      Find Difference of rows for given two dataframes
      this function is not symmetric, means
            diff(x, y) != diff(y, x)
      however
            diff(x, y, how='left') == diff(y, x, how='right')

      Ref: https://stackoverflow.com/questions/18180763/set-difference-for-pandas/40209800#40209800
    """
    if (df1.columns != df2.columns).any():
        raise ValueError("Two dataframe columns must match")

    if df1.equals(df2):
        return None
    elif how == 'right':
        return pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
    elif how == 'left':
        return pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
    else:
        raise ValueError('how parameter supports only "left" or "right keywords"')

例子:

df1 = pd.DataFrame(d1)
Out[1]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1             Graduated  Nick       False   1.11
2                         Zoe        True   4.12


df2 = pd.DataFrame(d2)

Out[2]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1           On vacation   Zoe        True   4.12

diff_df(df1, df2)
Out[3]: 
     Comment  Name  isEnrolled  score
1  Graduated  Nick       False   1.11
2              Zoe        True   4.12

diff_df(df2, df1)
Out[4]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

# This gives the same result as above
diff_df(df1, df2, how='right')
Out[22]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

这个答案只是扩展了@Andy Hayden的答案,使其能够适应数值字段为nan的情况,并将其包装成一个函数。

import pandas as pd
import numpy as np


def diff_pd(df1, df2):
    """Identify differences between two pandas DataFrames"""
    assert (df1.columns == df2.columns).all(), \
        "DataFrame column names are different"
    if any(df1.dtypes != df2.dtypes):
        "Data Types are different, trying to convert"
        df2 = df2.astype(df1.dtypes)
    if df1.equals(df2):
        return None
    else:
        # need to account for np.nan != np.nan returning True
        diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
        ne_stacked = diff_mask.stack()
        changed = ne_stacked[ne_stacked]
        changed.index.names = ['id', 'col']
        difference_locations = np.where(diff_mask)
        changed_from = df1.values[difference_locations]
        changed_to = df2.values[difference_locations]
        return pd.DataFrame({'from': changed_from, 'to': changed_to},
                            index=changed.index)

所以对于你的数据(稍微编辑一下,在分数列中有一个NaN):

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
diff_pd(df1, df2)

输出:

                from           to
id  col                          
112 score       1.11         1.21
113 isEnrolled  True        False
    Comment           On vacation
import pandas as pd
import numpy as np

df = pd.read_excel('D:\\HARISH\\DATA SCIENCE\\1 MY Training\\SAMPLE DATA & projs\\CRICKET DATA\\IPL PLAYER LIST\\IPL PLAYER LIST _ harish.xlsx')


df1= srh = df[df['TEAM'].str.contains("SRH")]
df2 = csk = df[df['TEAM'].str.contains("CSK")]   

srh = srh.iloc[:,0:2]
csk = csk.iloc[:,0:2]

csk = csk.reset_index(drop=True)
csk

srh = srh.reset_index(drop=True)
srh

new = pd.concat([srh, csk], axis=1)

new.head()

** 玩家类型 0 David Warner Batsman…多尼女士,机长 1 Bhuvaneshwar Kumar Bowler…拉文德拉·加德贾是全才 Manish Pandey Batsman…苏雷什·莱纳全能 拉希德·汗·阿尔曼·鲍勒…基达尔·贾达夫全能 4 Shikhar Dhawan Batsman ....多面手Dwayne Bravo

使用concat和drop_duplicate的不同方法:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)

df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)

输出:

       Name  score isEnrolled      Comment
  id                                      
1 112  Nick   1.11      False    Graduated
  113   Zoe    NaN       True             
2 112  Nick   1.21      False    Graduated
  113   Zoe    NaN      False  On vacation

如果您发现这个线程试图在测试中比较数据名称,那么请查看assert_frame_equal方法:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.testing.assert_frame_equal.html