Question about ML competition with small training set

ikedim · June 16, 2020, 5:11pm

I’m finalizing my submission for my first real ML competition,
and had a question - apologies if this is too elementary.

The competition’s task is regression on images (specifically
2D xrays) The training set is pretty small (about 700
images with relatively few positive examples),
and there’s no separate validation set given,
so to tune the hyperparameters I’ve been using 5-fold
cross validation on the training set data.

My thinking has been that after finding the best
hyperparameter settings I would then train the final
models for the submission on all the training data
(since the training set is so limited).

Is this a good idea, or should I stick with using
a model trained with a validation set split off
so I can see how the model performed on
the validation data?

stefan-ai · June 17, 2020, 8:02am

It sounds like your procedure is perfectly valid for this problem. For both cases - cross validation or fixed validation set - I would retrain on the entire dataset after hyper-parameter tuning in case of a small dataset. How many classes do you have and how are they distributed in the training set?

ikedim · June 17, 2020, 2:02pm

Thanks, Stefan!

We’re predicting about 40 target features on each image.

Each target has between 5 and 11 classes, but I’m using
regression because they are scores intended to measure
degree of damage (so between 0 and 4 or between 0 and 10).
Does that make sense?

As far as distribution, it’s heavily weighted to 0 -
for each target, between 80% and 95% of values are 0.

vferrer · June 19, 2020, 7:43am

About @stefan-ai recommendation on retraining the net with the whole dataset, it’s fine if you work to solve a real live problem. As it’s a competition, it’s better to train one model on each fold and, latter, average their predictions. Finally, remember to use TTA! Both tips will help to increase your score

ikedim · June 19, 2020, 1:55pm

Thanks, Victor! Very helpful …