我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。

谢谢!


当前回答

这是我在需要分割数据帧时所写的。我考虑过使用上面安迪的方法,但不喜欢我不能精确地控制数据集的大小(例如,有时是79,有时是81,等等)。

def make_sets(data_df, test_portion):
    import random as rnd

    tot_ix = range(len(data_df))
    test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
    train_ix = list(set(tot_ix) ^ set(test_ix))

    test_df = data_df.ix[test_ix]
    train_df = data_df.ix[train_ix]

    return train_df, test_df


train_df, test_df = make_sets(data_df, 0.2)
test_df.head()

其他回答

我会使用numpy的randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

为了证明这是有效的:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

熊猫随机抽样也可以

train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

对于相同的random_state值,您将始终在训练集和测试集中获得相同的确切数据。这带来了一定程度的可重复性,同时还随机分离训练和测试数据。

对我来说,更优雅一点的方法是创建一个随机列,然后按它进行分割,这样我们就可以得到一个符合我们需求的随机分割。

def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r

有许多方法可以创建训练/测试甚至验证样本。

案例1:没有任何选项的经典方法train_test_split:

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)

案例2:非常小的数据集(<500行):为了通过这种交叉验证获得所有行的结果。最后,您将对可用训练集的每一行都有一个预测。

from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

案例3a:用于分类的不平衡数据集。下面是情形1的等价解:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

案例3b:用于分类的不平衡数据集。在情形2之后,等价解如下:

from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

案例4:你需要在大数据上创建一个训练/测试/验证集来调优超参数(60%训练,20%测试和20% val)。

from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)

有很多有效的答案。又多了一个。 从sklearn。交叉验证导入train_test_split

#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]