我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。

谢谢!


当前回答

你也可以考虑分层划分为训练集和测试集。设定划分也随机生成训练集和测试集,但保留了原始的类比例。这使得训练集和测试集更好地反映原始数据集的属性。

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df[train_inds]和df[test_inds]为您提供原始DataFrame df的训练和测试集。

其他回答

不需要转换为numpy。只要用pandas df来做拆分,它就会返回一个pandas df。

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

如果你想把x和y分开

X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)

如果要分割整个df

X, y = df[list_of_x_cols], df[y_col]

你也可以考虑分层划分为训练集和测试集。设定划分也随机生成训练集和测试集,但保留了原始的类比例。这使得训练集和测试集更好地反映原始数据集的属性。

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df[train_inds]和df[test_inds]为您提供原始DataFrame df的训练和测试集。

Scikit Learn的train_test_split就是一个很好的例子。它将拆分numpy数组和数据框架。

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

您可以使用df.as_matrix()函数并创建Numpy-array并传递它。

Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)

这个怎么样? Df是我的数据框架

total_size=len(df)

train_size=math.floor(0.66*total_size) (2/3 part of my dataset)

#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)