我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。
谢谢!
我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。
谢谢!
当前回答
我将使用scikit-learn自己的training_test_split,并从索引生成它
from sklearn.model_selection import train_test_split
y = df.pop('output')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
其他回答
如果你想把它分成训练集、测试集和验证集,你可以使用这个函数:
from sklearn.model_selection import train_test_split
import pandas as pd
def train_test_val_split(df, test_size=0.15, val_size=0.45):
temp, test = train_test_split(df, test_size=test_size)
total_items_count = len(df.index)
val_length = total_items_count * val_size
new_val_propotion = val_length / len(temp.index)
train, val = train_test_split(temp, test_size=new_val_propotion)
return train, test, val
对我来说,更优雅一点的方法是创建一个随机列,然后按它进行分割,这样我们就可以得到一个符合我们需求的随机分割。
def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r
Scikit Learn的train_test_split就是一个很好的例子。它将拆分numpy数组和数据框架。
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
我认为你还需要一个副本,而不是一个切片的数据框架,如果你想以后添加列。
msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
有许多方法可以创建训练/测试甚至验证样本。
案例1:没有任何选项的经典方法train_test_split:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
案例2:非常小的数据集(<500行):为了通过这种交叉验证获得所有行的结果。最后,您将对可用训练集的每一行都有一个预测。
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
案例3a:用于分类的不平衡数据集。下面是情形1的等价解:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)
案例3b:用于分类的不平衡数据集。在情形2之后,等价解如下:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
案例4:你需要在大数据上创建一个训练/测试/验证集来调优超参数(60%训练,20%测试和20% val)。
from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)