Kfold validation

kofi · June 12, 2019, 3:37pm

My final year research work was on identifying minerals with CNNs. However, one recommendation I got from a lecturer was to use Kfold validation probably because the dataset size is small (<1000).
I want to know how kfold validation can be used in fastai.

kofi · June 12, 2019, 3:41pm

@init_27 @sgugger @lesscomfortable

mnpinto · June 12, 2019, 4:05pm

You can use sklearn KFold (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) to generate a number of random folds, maybe 5?

Then you train the model for each fold. If you are using the datablock api you just have to use split_by_idx with the indices given by KFold to get the correct train/valid split for each fold.

At the end you get 5 models, one trained on each fold.

kofi · June 12, 2019, 10:12pm

@mnpinto thanks but if I get you right I would have to train 5 different models right?

bfarzin · June 12, 2019, 10:39pm

Yes. Five models. Then you average the results like any other ensemble method. Also this give you and idea of how variable your validation metric or loss can be. If the losses across folds are very different, then your model is less likely to work well out of sample.

Seb · June 12, 2019, 10:50pm

People on Kaggle use K-fold all the time to train their LGBM (or similar) models. You might find a good base for your loop there.
Also look into sklearn’s stratified k-fold if you have imbalanced classes.