我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。

谢谢!


当前回答

你也可以考虑分层划分为训练集和测试集。设定划分也随机生成训练集和测试集,但保留了原始的类比例。这使得训练集和测试集更好地反映原始数据集的属性。

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df[train_inds]和df[test_inds]为您提供原始DataFrame df的训练和测试集。

其他回答

您可以使用df.as_matrix()函数并创建Numpy-array并传递它。

Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)

如果你想把它分成训练集、测试集和验证集,你可以使用这个函数:

from sklearn.model_selection import train_test_split
import pandas as pd

def train_test_val_split(df, test_size=0.15, val_size=0.45):
    temp, test = train_test_split(df, test_size=test_size)
    total_items_count = len(df.index)
    val_length = total_items_count * val_size
    new_val_propotion = val_length / len(temp.index) 
    train, val = train_test_split(temp, test_size=new_val_propotion)
    return train, test, val

Scikit Learn的train_test_split就是一个很好的例子。它将拆分numpy数组和数据框架。

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

上面有很多很好的答案,所以我只想再加一个例子,在这种情况下,你想通过使用numpy库来指定火车和测试集的确切样本数量。

# set the random seed for the reproducibility
np.random.seed(17)

# e.g. number of samples for the training set is 1000
n_train = 1000

# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)

# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]

train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]

test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]

如果你希望有一个数据帧和两个数据帧(不是numpy数组),这应该可以做到:

def split_data(df, train_perc = 0.8):

   df['train'] = np.random.rand(len(df)) < train_perc

   train = df[df.train == 1]

   test = df[df.train == 0]

   split_data ={'train': train, 'test': test}

   return split_data