
我主要分别使用80% / 20%的训练数据和验证数据,但我选择这种划分没有任何原则性的理由。能找个在机器学习方面更有经验的人给我出主意吗?





Split your data into training and testing (80/20 is indeed a good starting point) Split the training data into training and validation (again, 80/20 is a fair split). Subsample random selections of your training data, train the classifier with this, and record the performance on the validation set Try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. You should see both greater performance with more data, but also lower variance across the different random samples To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples






Split your data into training and testing (80/20 is indeed a good starting point) Split the training data into training and validation (again, 80/20 is a fair split). Subsample random selections of your training data, train the classifier with this, and record the performance on the validation set Try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. You should see both greater performance with more data, but also lower variance across the different random samples To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples


如果你有一个非常大的数据集,比如1,000,000个例子,分割80/10/10可能是不必要的,因为10% = 100,000个例子可能太多了,不能说模型工作得很好。




也许63.2% / 36.8%是一个合理的选择。原因可能是,如果总样本量为n,并希望从初始n中随机抽样替换(也称为重新抽样,如在统计引导中)n个案例,那么在重新抽样中选择单个案例的概率将约为0.632,前提是n不是太小,如这里解释的:https://stats.stackexchange.com/a/88993/16263

对于n=250的样本,单个案例被重新抽样到4位数字的概率为0.6329。 对于n=20000的样本,概率为0.6321。