Do You Re-train on the Whole Dataset After Validating the Model?
Suppose we have a dataset split into 80% for training and 20% for validation, do you do A) or B?
Method A)
- Train on 80%
- Validate on 20%
- Model is good, train on 100%.
- Predict test set.
Method B)
- Train on 80%
- Validate on 20%
- Model is good, use this model as is.
- Predict test set.
In this post, I’ve posted this question on Kaggle and I’ll summarize the answers here.
For myself, I do A), with the following reasons aggregated:
- More data is better. In case of time time series, including more recent data is always better.
- Cross validation is used to validate the hyper-parameters to train a model, rather than the model itself. You then pick the best parameters to re-train a model.
References: