Do You Re-train on the Whole Dataset After Validating the Model?

Posted on Jul 28, 2014 • lo

Suppose we have a dataset split into 80% for training and 20% for validation, do you do A) or B?

Method A)

Train on 80%
Validate on 20%
Model is good, train on 100%.
Predict test set.

Method B)

Train on 80%
Validate on 20%
Model is good, use this model as is.
Predict test set.

In this post, I’ve posted this question on Kaggle and I’ll summarize the answers here.

For myself, I do A), with the following reasons aggregated:

More data is better. In case of time time series, including more recent data is always better.
Cross validation is used to validate the hyper-parameters to train a model, rather than the model itself. You then pick the best parameters to re-train a model.

References: