我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。

谢谢!


当前回答

像这样从df中选择range row

row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]

其他回答

如果你需要根据你的数据集中的lables列来分割你的数据,你可以使用这个:

def split_to_train_test(df, label_column, train_frac=0.8):
    train_df, test_df = pd.DataFrame(), pd.DataFrame()
    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]
        lbl_train_df = lbl_df.sample(frac=train_frac)
        lbl_test_df = lbl_df.drop(lbl_train_df.index)
        print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
        train_df = train_df.append(lbl_train_df)
        test_df = test_df.append(lbl_test_df)

    return train_df, test_df

并使用它:

train, test = split_to_train_test(data, 'class', 0.7)

如果你想控制分割随机性或使用一些全局随机种子,你也可以传递random_state。

熊猫随机抽样也可以

train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

对于相同的random_state值,您将始终在训练集和测试集中获得相同的确切数据。这带来了一定程度的可重复性,同时还随机分离训练和测试数据。

对我来说,更优雅一点的方法是创建一个随机列,然后按它进行分割,这样我们就可以得到一个符合我们需求的随机分割。

def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r

我会使用numpy的randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

为了证明这是有效的:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

你也可以考虑分层划分为训练集和测试集。设定划分也随机生成训练集和测试集,但保留了原始的类比例。这使得训练集和测试集更好地反映原始数据集的属性。

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df[train_inds]和df[test_inds]为您提供原始DataFrame df的训练和测试集。