我试图加入两个熊猫数据框架使用两列:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

但得到了以下错误:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()

KeyError: '[B_1, c2]'

你知道正确的方法是什么吗?


试试这个

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

left_on:标签或列表,或数组类的字段名,以左侧连接 DataFrame。可以是向量的长度或向量的列表 DataFrame使用特定的向量作为连接键,而不是 列 right_on:标签或列表,或要加入的类似数组的字段名 在右DataFrame或vector/ vector列表中每left_on文档


这里的问题是,通过使用撇号,您正在将传递的值设置为字符串,而实际上,正如文档中@Shijo所述,该函数期望的是标签或列表,而不是字符串!如果列表包含为左右数据框架传递的每个列名,则每个列名必须单独在撇号中。根据上面所说的,我们可以理解为什么这是不正确的:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

下面是正确使用函数的方法:

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

另一种表达方式:

new_df = A_df.merge(B_df, left_on=['A_c1','c2'], right_on = ['B_c1','c2'], how='left')

你可以用下面的句子,这句话简短易懂:

merged_data= df1.merge(df2, on=["column1","column2"])

这对我来说是可行的,对于n个文件XLS

# all_reports_paths contain one array with all paths per files
for a in all_reports_paths:
    
    df.append( pd.read_excel(a,skiprows=X,skipfooter=X))

df_glob = pd.DataFrame(columns=columns)

for dataframe in df:

    df_glob = pd.concat([df_glob,pd.DataFrame(dataframe)],axis=0)

# finally df_glob contain all data

It merges according to the ordering of left_on and right_on, i.e., the i-th element of left_on will match with the i-th of right_on. In the example below, the code on the top matches A_col1 with B_col1 and A_col2 with B_col2, while the code on the bottom matches A_col1 with B_col2 and A_col2 with B_col1. Evidently, the results are different. As can be seen from the above example, if the merge keys have different names, all keys will show up as their individual columns in the merged dataframe. In the example above, in the top dataframe, A_col1 and B_col1 are identical and A_col2 and B_col2 are identical. In the bottom dataframe, A_col1 and B_col2 are identical and A_col2 and B_col1 are identical. Since these are duplicate columns, they are most likely not needed. One way to not have this problem from the beginning is to make the merge keys identical from the beginning. See bullet point #3 below. If left_on and right_on are the same col1 and col2, we can use on=['col1', 'col2']. In this case, no merge keys are duplicated. df1.merge(df2, on=['col1', 'col2']) You can also merge one side on column names and the other side on index too. For example, in the example below, df1's columns are matched with df2's indices. If the indices are named, as in the example below, you can reference them by name but if not, you can also use right_index=True (or left_index=True if the left dataframe is the one being merged on index). df1.merge(df2, left_on=['A_col1', 'A_col2'], right_index=True) # or df1.merge(df2, left_on=['A_col1', 'A_col2'], right_on=['B_col1', 'B_col2']) By using the how= parameter, you can perform LEFT JOIN (how='left'), FULL OUTER JOIN (how='outer') and RIGHT JOIN (how='right') as well. The default is INNER JOIN (how='inner') as in the examples above. If you have more than 2 dataframes to merge and the merge keys are the same across all of them, then join method is more efficient than merge because you can pass a list of dataframes and join on indices. Note that the index names are the same across all dataframes in the example below (col1 and col2). Note that the indices don't have to have names; if the indices don't have names, then the number of the multi-indices must match (in the case below there are 2 multi-indices). Again, as in bullet point #1, the match occurs according to the ordering of the indices. df1.join([df2, df3], how='inner').reset_index()