我有一个在轴1(列)中具有层次索引的数据帧(来自groupby。gg操作):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf       
                                     sum   sum   sum    sum   amax   amin
0  702730  26451  1993      1    1     1     0    12     13  30.92  24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00  24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00   6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04   3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94  10.94

我想把它压平,使它看起来像这样(名字不重要-我可以重命名):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf_amax  tmpf_amin   
0  702730  26451  1993      1    1     1     0    12     13  30.92          24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00          24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00          6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04          3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94          10.94

我怎么做呢?(我尝试了很多,但都无济于事。)

根据建议,这里是字典形式的头部

{('USAF', ''): {0: '702730',
  1: '702730',
  2: '702730',
  3: '702730',
  4: '702730'},
 ('WBAN', ''): {0: '26451', 1: '26451', 2: '26451', 3: '26451', 4: '26451'},
 ('day', ''): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 ('month', ''): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 ('s_CD', 'sum'): {0: 12.0, 1: 13.0, 2: 2.0, 3: 12.0, 4: 10.0},
 ('s_CL', 'sum'): {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
 ('s_CNT', 'sum'): {0: 13.0, 1: 13.0, 2: 13.0, 3: 13.0, 4: 13.0},
 ('s_PC', 'sum'): {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 3.0},
 ('tempf', 'amax'): {0: 30.920000000000002,
  1: 32.0,
  2: 23.0,
  3: 10.039999999999999,
  4: 19.939999999999998},
 ('tempf', 'amin'): {0: 24.98,
  1: 24.98,
  2: 6.9799999999999969,
  3: 3.9199999999999982,
  4: 10.940000000000001},
 ('year', ''): {0: 1993, 1: 1993, 2: 1993, 3: 1993, 4: 1993}}

当前回答

在读完所有的答案后,我想到了这个:

def __my_flatten_cols(self, how="_".join, reset_index=True):
    how = (lambda iter: list(iter)[-1]) if how == "last" else how
    self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
                    if isinstance(self.columns, pd.MultiIndex) else self.columns
    return self.reset_index() if reset_index else self
pd.DataFrame.my_flatten_cols = __my_flatten_cols

用法:

给定一个数据帧:

df = pd.DataFrame({"grouper": ["x","x","y","y"], "val1": [0,2,4,6], 2: [1,3,5,7]}, columns=["grouper", "val1", 2])

  grouper  val1  2
0       x     0  1
1       x     2  3
2       y     4  5
3       y     6  7

Single aggregation method: resulting variables named the same as source: df.groupby(by="grouper").agg("min").my_flatten_cols() Same as df.groupby(by="grouper", as_index=False) or .agg(...).reset_index() ----- before ----- val1 2 grouper ------ after ----- grouper val1 2 0 x 0 1 1 y 4 5 Single source variable, multiple aggregations: resulting variables named after statistics: df.groupby(by="grouper").agg({"val1": [min,max]}).my_flatten_cols("last") Same as a = df.groupby(..).agg(..); a.columns = a.columns.droplevel(0); a.reset_index(). ----- before ----- val1 min max grouper ------ after ----- grouper min max 0 x 0 2 1 y 4 6 Multiple variables, multiple aggregations: resulting variables named (varname)_(statname): df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols() # you can combine the names in other ways too, e.g. use a different delimiter: #df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols(" ".join) Runs a.columns = ["_".join(filter(None, map(str, levels))) for levels in a.columns.values] under the hood (since this form of agg() results in MultiIndex on columns). If you don't have the my_flatten_cols helper, it might be easier to type in the solution suggested by @Seigi: a.columns = ["_".join(t).rstrip("_") for t in a.columns.values], which works similarly in this case (but fails if you have numeric labels on columns) To handle the numeric labels on columns, you could use the solution suggested by @jxstanford and @Nolan Conaway (a.columns = ["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values]), but I don't understand why the tuple() call is needed, and I believe rstrip() is only required if some columns have a descriptor like ("colname", "") (which can happen if you reset_index() before trying to fix up .columns) ----- before ----- val1 2 min sum size grouper ------ after ----- grouper val1_min 2_sum 2_size 0 x 0 4 2 1 y 4 12 2 You want to name the resulting variables manually: (this is deprecated since pandas 0.20.0 with no adequate alternative as of 0.23) df.groupby(by="grouper").agg({"val1": {"sum_of_val1": "sum", "count_of_val1": "count"}, 2: {"sum_of_2": "sum", "count_of_2": "count"}}).my_flatten_cols("last") Other suggestions include: setting the columns manually: res.columns = ['A_sum', 'B_sum', 'count'] or .join()ing multiple groupby statements. ----- before ----- val1 2 count_of_val1 sum_of_val1 count_of_2 sum_of_2 grouper ------ after ----- grouper count_of_val1 sum_of_val1 count_of_2 sum_of_2 0 x 2 2 2 4 1 y 2 10 2 12

由helper函数处理的情况

level names can be non-string, e.g. Index pandas DataFrame by column numbers, when column names are integers, so we have to convert with map(str, ..) they can also be empty, so we have to filter(None, ..) for single-level columns (i.e. anything except MultiIndex), columns.values returns the names (str, not tuples) depending on how you used .agg() you may need to keep the bottom-most label for a column or concatenate multiple labels (since I'm new to pandas?) more often than not, I want reset_index() to be able to work with the group-by columns in the regular way, so it does that by default

其他回答

为了在其他DataFrame方法链中平展MultiIndex,定义一个这样的函数:

def flatten_index(df):
  df_copy = df.copy()
  df_copy.columns = ['_'.join(col).rstrip('_') for col in df_copy.columns.values]
  return df_copy.reset_index()

然后使用管道方法在DataFrame方法链中应用这个函数,在groupby和agg之后,但在链中任何其他方法之前:

my_df \
  .groupby('group') \
  .agg({'value': ['count']}) \
  .pipe(flatten_index) \
  .sort_values('value_count')

在读完所有的答案后,我想到了这个:

def __my_flatten_cols(self, how="_".join, reset_index=True):
    how = (lambda iter: list(iter)[-1]) if how == "last" else how
    self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
                    if isinstance(self.columns, pd.MultiIndex) else self.columns
    return self.reset_index() if reset_index else self
pd.DataFrame.my_flatten_cols = __my_flatten_cols

用法:

给定一个数据帧:

df = pd.DataFrame({"grouper": ["x","x","y","y"], "val1": [0,2,4,6], 2: [1,3,5,7]}, columns=["grouper", "val1", 2])

  grouper  val1  2
0       x     0  1
1       x     2  3
2       y     4  5
3       y     6  7

Single aggregation method: resulting variables named the same as source: df.groupby(by="grouper").agg("min").my_flatten_cols() Same as df.groupby(by="grouper", as_index=False) or .agg(...).reset_index() ----- before ----- val1 2 grouper ------ after ----- grouper val1 2 0 x 0 1 1 y 4 5 Single source variable, multiple aggregations: resulting variables named after statistics: df.groupby(by="grouper").agg({"val1": [min,max]}).my_flatten_cols("last") Same as a = df.groupby(..).agg(..); a.columns = a.columns.droplevel(0); a.reset_index(). ----- before ----- val1 min max grouper ------ after ----- grouper min max 0 x 0 2 1 y 4 6 Multiple variables, multiple aggregations: resulting variables named (varname)_(statname): df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols() # you can combine the names in other ways too, e.g. use a different delimiter: #df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols(" ".join) Runs a.columns = ["_".join(filter(None, map(str, levels))) for levels in a.columns.values] under the hood (since this form of agg() results in MultiIndex on columns). If you don't have the my_flatten_cols helper, it might be easier to type in the solution suggested by @Seigi: a.columns = ["_".join(t).rstrip("_") for t in a.columns.values], which works similarly in this case (but fails if you have numeric labels on columns) To handle the numeric labels on columns, you could use the solution suggested by @jxstanford and @Nolan Conaway (a.columns = ["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values]), but I don't understand why the tuple() call is needed, and I believe rstrip() is only required if some columns have a descriptor like ("colname", "") (which can happen if you reset_index() before trying to fix up .columns) ----- before ----- val1 2 min sum size grouper ------ after ----- grouper val1_min 2_sum 2_size 0 x 0 4 2 1 y 4 12 2 You want to name the resulting variables manually: (this is deprecated since pandas 0.20.0 with no adequate alternative as of 0.23) df.groupby(by="grouper").agg({"val1": {"sum_of_val1": "sum", "count_of_val1": "count"}, 2: {"sum_of_2": "sum", "count_of_2": "count"}}).my_flatten_cols("last") Other suggestions include: setting the columns manually: res.columns = ['A_sum', 'B_sum', 'count'] or .join()ing multiple groupby statements. ----- before ----- val1 2 count_of_val1 sum_of_val1 count_of_2 sum_of_2 grouper ------ after ----- grouper count_of_val1 sum_of_val1 count_of_2 sum_of_2 0 x 2 2 2 4 1 y 2 10 2 12

由helper函数处理的情况

level names can be non-string, e.g. Index pandas DataFrame by column numbers, when column names are integers, so we have to convert with map(str, ..) they can also be empty, so we have to filter(None, ..) for single-level columns (i.e. anything except MultiIndex), columns.values returns the names (str, not tuples) depending on how you used .agg() you may need to keep the bottom-most label for a column or concatenate multiple labels (since I'm new to pandas?) more often than not, I want reset_index() to be able to work with the group-by columns in the regular way, so it does that by default

也许有点晚了,但如果你不担心重复的列名:

df.columns = df.columns.tolist()

我认为最简单的方法是将列设置为顶层:

df.columns = df.columns.get_level_values(0)

注意:如果to级别有名称,您也可以通过this访问它,而不是0。

.

如果你想合并/加入你的MultiIndex到一个索引(假设你的列中只有字符串条目),你可以:

df.columns = [' '.join(col).strip() for col in df.columns.values]

注意:当没有第二个索引时,我们必须去掉空白。

In [11]: [' '.join(col).strip() for col in df.columns.values]
Out[11]: 
['USAF',
 'WBAN',
 'day',
 'month',
 's_CD sum',
 's_CL sum',
 's_CNT sum',
 's_PC sum',
 'tempf amax',
 'tempf amin',
 'year']

这个帖子上的所有答案都有点过时了。在pandas 0.24.0版本中,.to_flat_index()可以满足您的需要。

来自panda自己的文档:

MultiIndex.to_flat_index () 将MultiIndex转换为包含关卡值的元组索引。

文档中的一个简单例子:

import pandas as pd
print(pd.__version__) # '0.23.4'
index = pd.MultiIndex.from_product(
        [['foo', 'bar'], ['baz', 'qux']],
        names=['a', 'b'])

print(index)
# MultiIndex(levels=[['bar', 'foo'], ['baz', 'qux']],
#           codes=[[1, 1, 0, 0], [0, 1, 0, 1]],
#           names=['a', 'b'])

应用to_flat_index ():

index.to_flat_index()
# Index([('foo', 'baz'), ('foo', 'qux'), ('bar', 'baz'), ('bar', 'qux')], dtype='object')

用它代替现有的熊猫柱

一个你如何在dat上使用它的例子,这是一个带MultiIndex列的DataFrame:

dat = df.loc[:,['name','workshop_period','class_size']].groupby(['name','workshop_period']).describe()
print(dat.columns)
# MultiIndex(levels=[['class_size'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']],
#            codes=[[0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7]])

dat.columns = dat.columns.to_flat_index()
print(dat.columns)
# Index([('class_size', 'count'),  ('class_size', 'mean'),
#     ('class_size', 'std'),   ('class_size', 'min'),
#     ('class_size', '25%'),   ('class_size', '50%'),
#     ('class_size', '75%'),   ('class_size', 'max')],
#  dtype='object')

就地扁化和重命名

可能值得注意的是,如何将它与一个简单的列表理解(感谢@Skippy和@mmann1123)结合起来连接元素,这样你得到的列名就是简单的字符串,例如用下划线分隔:

dat.columns = ["_".join(a) for a in dat.columns.to_flat_index()]