从数据框架中删除重复列的最简单方法是什么?
我正在阅读一个文本文件,通过重复的列:
import pandas as pd
df=pd.read_table(fname)
列名为:
Time, Time Relative, N2, Time, Time Relative, H2, etc...
所有“时间”和“时间相对”列包含相同的数据。我想要:
Time, Time Relative, N2, H2
我所有的尝试删除,删除等,如:
df=df.T.drop_duplicates().T
导致唯一值的索引错误:
Reindexing only valid with uniquely valued index objects
对不起,我是熊猫的菜鸟。任何建议将不胜感激。
额外的细节
熊猫版本:0.9.0
Python版本:2.7.3
Windows 7
(通过Pythonxy 2.7.3.0安装)
数据文件(注:在实际文件中,列之间以制表符分隔,此处以4个空格分隔):
Time Time Relative [s] N2[%] Time Time Relative [s] H2[ppm]
2/12/2013 9:20:55 AM 6.177 9.99268e+001 2/12/2013 9:20:55 AM 6.177 3.216293e-005
2/12/2013 9:21:06 AM 17.689 9.99296e+001 2/12/2013 9:21:06 AM 17.689 3.841667e-005
2/12/2013 9:21:18 AM 29.186 9.992954e+001 2/12/2013 9:21:18 AM 29.186 3.880365e-005
... etc ...
2/12/2013 2:12:44 PM 17515.269 9.991756+001 2/12/2013 2:12:44 PM 17515.269 2.800279e-005
2/12/2013 2:12:55 PM 17526.769 9.991754e+001 2/12/2013 2:12:55 PM 17526.769 2.880386e-005
2/12/2013 2:13:07 PM 17538.273 9.991797e+001 2/12/2013 2:13:07 PM 17538.273 3.131447e-005
简单的列比较是按值检查重复列的最有效方法(就内存和时间而言)。这里有一个例子:
import numpy as np
import pandas as pd
from itertools import combinations as combi
df = pd.DataFrame(np.random.uniform(0,1, (100,4)), columns=['a','b','c','d'])
df['a'] = df['d'].copy() # column 'a' is equal to column 'd'
# to keep the first
dupli_cols = [cc[1] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
# to keep the last
dupli_cols = [cc[0] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
df = df.drop(columns=dupli_cols)
以防有人还在寻找如何在Python中为Pandas数据帧的列中寻找重复值的答案,我想出了这个解决方案:
def get_dup_columns(m):
'''
This will check every column in data frame
and verify if you have duplicated columns.
can help whenever you are cleaning big data sets of 50+ columns
and clean up a little bit for you
The result will be a list of tuples showing what columns are duplicates
for example
(column A, Column C)
That means that column A is duplicated with column C
more info go to https://wanatux.com
'''
headers_list = [x for x in m.columns]
duplicate_col2 = []
y = 0
while y <= len(headers_list)-1:
for x in range(1,len(headers_list)-1):
if m[headers_list[y]].equals(m[headers_list[x]]) == False:
continue
else:
duplicate_col2.append((headers_list[y],headers_list[x]))
headers_list.pop(0)
return duplicate_col2
你可以像这样强制转换定义:
duplicate_col = get_dup_columns(pd_excel)
它将显示如下结果:
[('column a', 'column k'),
('column a', 'column r'),
('column h', 'column m'),
('column k', 'column r')]