我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。
谢谢!
我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。
谢谢!
当前回答
Scikit Learn的train_test_split就是一个很好的例子。它将拆分numpy数组和数据框架。
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
其他回答
上面有很多很好的答案,所以我只想再加一个例子,在这种情况下,你想通过使用numpy库来指定火车和测试集的确切样本数量。
# set the random seed for the reproducibility
np.random.seed(17)
# e.g. number of samples for the training set is 1000
n_train = 1000
# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)
# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]
train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]
test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]
shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]
在我的例子中,我想用特定的数字分割训练、测试和开发中的数据帧。我在这里分享我的解决方案
首先,为数据帧分配一个唯一的id(如果已经不存在的话)
import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]
以下是我的分割数字:
train = 120765
test = 4134
dev = 2816
分裂函数
def df_split(df, n):
first = df.sample(n)
second = df[~df.id.isin(list(first['id']))]
first.reset_index(drop=True, inplace = True)
second.reset_index(drop=True, inplace = True)
return first, second
现在分成培训,测试,开发
train, test = df_split(df, 120765)
test, dev = df_split(test, 4134)
您需要将pandas数据帧转换为numpy数组,然后将numpy数组转换回数据帧
import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)
如果你希望有一个数据帧和两个数据帧(不是numpy数组),这应该可以做到:
def split_data(df, train_perc = 0.8):
df['train'] = np.random.rand(len(df)) < train_perc
train = df[df.train == 1]
test = df[df.train == 0]
split_data ={'train': train, 'test': test}
return split_data