从数据框架中删除重复列的最简单方法是什么?
我正在阅读一个文本文件,通过重复的列:
import pandas as pd
df=pd.read_table(fname)
列名为:
Time, Time Relative, N2, Time, Time Relative, H2, etc...
所有“时间”和“时间相对”列包含相同的数据。我想要:
Time, Time Relative, N2, H2
我所有的尝试删除,删除等,如:
df=df.T.drop_duplicates().T
导致唯一值的索引错误:
Reindexing only valid with uniquely valued index objects
对不起,我是熊猫的菜鸟。任何建议将不胜感激。
额外的细节
熊猫版本:0.9.0
Python版本:2.7.3
Windows 7
(通过Pythonxy 2.7.3.0安装)
数据文件(注:在实际文件中,列之间以制表符分隔,此处以4个空格分隔):
Time Time Relative [s] N2[%] Time Time Relative [s] H2[ppm]
2/12/2013 9:20:55 AM 6.177 9.99268e+001 2/12/2013 9:20:55 AM 6.177 3.216293e-005
2/12/2013 9:21:06 AM 17.689 9.99296e+001 2/12/2013 9:21:06 AM 17.689 3.841667e-005
2/12/2013 9:21:18 AM 29.186 9.992954e+001 2/12/2013 9:21:18 AM 29.186 3.880365e-005
... etc ...
2/12/2013 2:12:44 PM 17515.269 9.991756+001 2/12/2013 2:12:44 PM 17515.269 2.800279e-005
2/12/2013 2:12:55 PM 17526.769 9.991754e+001 2/12/2013 2:12:55 PM 17526.769 2.880386e-005
2/12/2013 2:13:07 PM 17538.273 9.991797e+001 2/12/2013 2:13:07 PM 17538.273 3.131447e-005
转置对于大数据帧来说效率很低。这里有一个替代方案:
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
dcols = frame[v].to_dict(orient="list")
vs = dcols.values()
ks = dcols.keys()
lvs = len(vs)
for i in range(lvs):
for j in range(i+1,lvs):
if vs[i] == vs[j]:
dups.append(ks[i])
break
return dups
像这样使用它:
dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)
Edit
一个内存高效的版本,像对待其他值一样对待nan:
from pandas.core.common import array_equivalent
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
ia = vs.iloc[:,i].values
for j in range(i+1, lcs):
ja = vs.iloc[:,j].values
if array_equivalent(ia, ja):
dups.append(cs[i])
break
return dups
转置对于大数据帧来说效率很低。这里有一个替代方案:
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
dcols = frame[v].to_dict(orient="list")
vs = dcols.values()
ks = dcols.keys()
lvs = len(vs)
for i in range(lvs):
for j in range(i+1,lvs):
if vs[i] == vs[j]:
dups.append(ks[i])
break
return dups
像这样使用它:
dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)
Edit
一个内存高效的版本,像对待其他值一样对待nan:
from pandas.core.common import array_equivalent
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
ia = vs.iloc[:,i].values
for j in range(i+1, lcs):
ja = vs.iloc[:,j].values
if array_equivalent(ia, ja):
dups.append(cs[i])
break
return dups