我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。
谢谢!
我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。
谢谢!
当前回答
上面有很多很好的答案,所以我只想再加一个例子,在这种情况下,你想通过使用numpy库来指定火车和测试集的确切样本数量。
# set the random seed for the reproducibility
np.random.seed(17)
# e.g. number of samples for the training set is 1000
n_train = 1000
# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)
# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]
train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]
test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]
其他回答
将df分成训练,验证,测试。给定增广数据的df,只选择相关列和独立列。将最近的10%的行(使用'dates'列)分配给test_df。随机将剩余行的10%分配给validate_df,其余的分配给train_df。不要重新索引。检查所有行是否都是唯一分配的。只使用本地蟒和熊猫库。
方法1:将行分割为训练、验证、测试数据框架。
train_df = augmented_df[dependent_and_independent_columns]
test_df = train_df.sort_values('dates').tail(int(len(augmented_df)*0.1)) # select latest 10% of dates for test data
train_df = train_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_df.sample(frac=0.1) # randomly assign 10%
train_df = train_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
方法2:当validate必须是train的子集时拆分行(fastai)
train_validate_test_df = augmented_df[dependent_and_independent_columns]
test_df = train_validate_test_df.loc[augmented_df.sort_values('dates').tail(int(len(augmented_df)*0.1)).index] # select latest 10% of dates for test data
train_validate_df = train_validate_test_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_validate_df.sample(frac=validate_ratio) # assign 10% to validate_df
train_df = train_validate_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
# fastai example usage
dls = fastai.tabular.all.TabularDataLoaders.from_df(
train_validate_df, valid_idx=train_validate_df.index.get_indexer_for(validate_df.index))
我将使用scikit-learn自己的training_test_split,并从索引生成它
from sklearn.model_selection import train_test_split
y = df.pop('output')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
如果你希望有一个数据帧和两个数据帧(不是numpy数组),这应该可以做到:
def split_data(df, train_perc = 0.8):
df['train'] = np.random.rand(len(df)) < train_perc
train = df[df.train == 1]
test = df[df.train == 0]
split_data ={'train': train, 'test': test}
return split_data
示例方法选择数据的一部分,您可以先通过传递种子值来打乱数据。
train = df.sample(frac=0.8, random_state=42)
对于测试集,您可以删除通过train DF索引的行,然后重置新DF的索引。
test = df.drop(train_data.index).reset_index(drop=True)
你也可以考虑分层划分为训练集和测试集。设定划分也随机生成训练集和测试集,但保留了原始的类比例。这使得训练集和测试集更好地反映原始数据集的属性。
import numpy as np
def get_train_test_inds(y,train_proportion=0.7):
'''Generates indices, making random stratified split into training set and testing sets
with proportions train_proportion and (1-train_proportion) of initial sample.
y is any iterable indicating classes of each observation in the sample.
Initial proportions of classes inside training and
testing sets are preserved (stratified sampling).
'''
y=np.array(y)
train_inds = np.zeros(len(y),dtype=bool)
test_inds = np.zeros(len(y),dtype=bool)
values = np.unique(y)
for value in values:
value_inds = np.nonzero(y==value)[0]
np.random.shuffle(value_inds)
n = int(train_proportion*len(value_inds))
train_inds[value_inds[:n]]=True
test_inds[value_inds[n:]]=True
return train_inds,test_inds
df[train_inds]和df[test_inds]为您提供原始DataFrame df的训练和测试集。