从数据框架中删除重复列的最简单方法是什么?
我正在阅读一个文本文件,通过重复的列:
import pandas as pd
df=pd.read_table(fname)
列名为:
Time, Time Relative, N2, Time, Time Relative, H2, etc...
所有“时间”和“时间相对”列包含相同的数据。我想要:
Time, Time Relative, N2, H2
我所有的尝试删除,删除等,如:
df=df.T.drop_duplicates().T
导致唯一值的索引错误:
Reindexing only valid with uniquely valued index objects
对不起,我是熊猫的菜鸟。任何建议将不胜感激。
额外的细节
熊猫版本:0.9.0
Python版本:2.7.3
Windows 7
(通过Pythonxy 2.7.3.0安装)
数据文件(注:在实际文件中,列之间以制表符分隔,此处以4个空格分隔):
Time Time Relative [s] N2[%] Time Time Relative [s] H2[ppm]
2/12/2013 9:20:55 AM 6.177 9.99268e+001 2/12/2013 9:20:55 AM 6.177 3.216293e-005
2/12/2013 9:21:06 AM 17.689 9.99296e+001 2/12/2013 9:21:06 AM 17.689 3.841667e-005
2/12/2013 9:21:18 AM 29.186 9.992954e+001 2/12/2013 9:21:18 AM 29.186 3.880365e-005
... etc ...
2/12/2013 2:12:44 PM 17515.269 9.991756+001 2/12/2013 2:12:44 PM 17515.269 2.800279e-005
2/12/2013 2:12:55 PM 17526.769 9.991754e+001 2/12/2013 2:12:55 PM 17526.769 2.880386e-005
2/12/2013 2:13:07 PM 17538.273 9.991797e+001 2/12/2013 2:13:07 PM 17538.273 3.131447e-005
以防有人还在寻找如何在Python中为Pandas数据帧的列中寻找重复值的答案,我想出了这个解决方案:
def get_dup_columns(m):
'''
This will check every column in data frame
and verify if you have duplicated columns.
can help whenever you are cleaning big data sets of 50+ columns
and clean up a little bit for you
The result will be a list of tuples showing what columns are duplicates
for example
(column A, Column C)
That means that column A is duplicated with column C
more info go to https://wanatux.com
'''
headers_list = [x for x in m.columns]
duplicate_col2 = []
y = 0
while y <= len(headers_list)-1:
for x in range(1,len(headers_list)-1):
if m[headers_list[y]].equals(m[headers_list[x]]) == False:
continue
else:
duplicate_col2.append((headers_list[y],headers_list[x]))
headers_list.pop(0)
return duplicate_col2
你可以像这样强制转换定义:
duplicate_col = get_dup_columns(pd_excel)
它将显示如下结果:
[('column a', 'column k'),
('column a', 'column r'),
('column h', 'column m'),
('column k', 'column r')]
听起来好像您已经知道了唯一的列名。如果是这样的话,那么df = df['时间','时间相对','N2']将有效。
如果不是,你的解决方案应该工作:
In [101]: vals = np.random.randint(0,20, (4,3))
vals
Out[101]:
array([[ 3, 13, 0],
[ 1, 15, 14],
[14, 19, 14],
[19, 5, 1]])
In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
df
Out[106]:
Time H1 N2 Time Relative N2 Time
0 3 13 0 3 13 0
1 1 15 14 1 15 14
2 14 19 14 14 19 14
3 19 5 1 19 5 1
In [107]: df.T.drop_duplicates().T
Out[107]:
Time H1 N2
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
您可能有一些特定于您的数据的东西搞砸了。如果你能提供更多关于数据的细节,我们会给予更多的帮助。
编辑:
如Andy所说,问题可能在于重复的列标题。
对于一个示例表文件'dummy.csv',我创建了:
Time H1 N2 Time N2 Time Relative
3 13 13 3 13 0
1 15 15 1 15 14
14 19 19 14 19 14
19 5 5 19 5 1
使用read_table提供唯一的列并正常工作:
In [151]: df2 = pd.read_table('dummy.csv')
df2
Out[151]:
Time H1 N2 Time.1 N2.1 Time Relative
0 3 13 13 3 13 0
1 1 15 15 1 15 14
2 14 19 19 14 19 14
3 19 5 5 19 5 1
In [152]: df2.T.drop_duplicates().T
Out[152]:
Time H1 Time Relative
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
如果你的版本不允许,你可以拼凑一个解决方案,使它们独一无二:
In [169]: df2 = pd.read_table('dummy.csv', header=None)
df2
Out[169]:
0 1 2 3 4 5
0 Time H1 N2 Time N2 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [171]: from collections import defaultdict
col_counts = defaultdict(int)
col_ix = df2.first_valid_index()
In [172]: cols = []
for col in df2.ix[col_ix]:
cnt = col_counts[col]
col_counts[col] += 1
suf = '_' + str(cnt) if cnt else ''
cols.append(col + suf)
cols
Out[172]:
['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
Time H1 N2 Time_1 N2_1 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [178]: df2.T.drop_duplicates().T
Out[178]:
Time H1 Time Relative
1 3 13 0
2 1 15 14
3 14 19 14
4 19 5 1