我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。

谢谢!


当前回答

如果你想把它分成训练集、测试集和验证集,你可以使用这个函数:

from sklearn.model_selection import train_test_split
import pandas as pd

def train_test_val_split(df, test_size=0.15, val_size=0.45):
    temp, test = train_test_split(df, test_size=test_size)
    total_items_count = len(df.index)
    val_length = total_items_count * val_size
    new_val_propotion = val_length / len(temp.index) 
    train, val = train_test_split(temp, test_size=new_val_propotion)
    return train, test, val

其他回答

上面有很多很好的答案,所以我只想再加一个例子,在这种情况下,你想通过使用numpy库来指定火车和测试集的确切样本数量。

# set the random seed for the reproducibility
np.random.seed(17)

# e.g. number of samples for the training set is 1000
n_train = 1000

# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)

# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]

train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]

test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]

我将使用scikit-learn自己的training_test_split,并从索引生成它

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train

我会用K-fold交叉验证。 它已被证明比train_test_split提供更好的结果。下面是一篇关于如何在sklearn中应用它的文章,来自文档本身:https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

如果你想把它分成训练集、测试集和验证集,你可以使用这个函数:

from sklearn.model_selection import train_test_split
import pandas as pd

def train_test_val_split(df, test_size=0.15, val_size=0.45):
    temp, test = train_test_split(df, test_size=test_size)
    total_items_count = len(df.index)
    val_length = total_items_count * val_size
    new_val_propotion = val_length / len(temp.index) 
    train, val = train_test_split(temp, test_size=new_val_propotion)
    return train, test, val

这是我在需要分割数据帧时所写的。我考虑过使用上面安迪的方法,但不喜欢我不能精确地控制数据集的大小(例如,有时是79,有时是81,等等)。

def make_sets(data_df, test_portion):
    import random as rnd

    tot_ix = range(len(data_df))
    test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
    train_ix = list(set(tot_ix) ^ set(test_ix))

    test_df = data_df.ix[test_ix]
    train_df = data_df.ix[train_ix]

    return train_df, test_df


train_df, test_df = make_sets(data_df, 0.2)
test_df.head()