I’ve been studying Fastai for a while, and am now using
it to participate in an image-related ML competition.
I’ve started to get some decent-looking results, but
am getting a bit stuck on how to tune the hyperparameter
choices. Specifically, there is quite a bit of random
noise in my results - the final metric’s value on the
validation set (MSE on a collection of predicted labels)
can vary by as much as 10% between training runs with
the same set of hyperparameters. I’m therefore having
trouble distinguishing when a hyperparameter change
is actually improving results.
Any tips on dealing with this situation? Do you try
to find choices that give dramatic (>10%) improvement?
Or to make the training more repeatable? Any other ideas?
Is this a small dataset? This can definitely be more dramatic on smaller datasets. In that case you may have to look at alternative validation strategies such as k-fold.
Off the top of my head increasing momentum, square momentum, batch size, epsilon might make things less variable. Batch size has the least negative trade-offs from increasing it, so if you can I would increase that first.
Momentum - more of moving average weights are used
Square Momentum - same as momentum but the squares and is a divisor
Batch size - effectively will average out the gradients. If you are using a small batch size this might make a huge difference (<32)
Epsilon - Can offset square average being too small, and make training a bit more stable
I have no idea what the effect of weight decay or dropout would have on the variability of the loss.
I would suggest setting up the Weights and Biases callback. It helps visualize a lot of these variables and can help you get an intuitive understanding of these things.
It is a pretty small dataset, so that could be increasing the swings.
I’ve tried cross validation, but still get significant variation,
and of course it takes much longer.
I do have some room to increase batch size, so will try that - didn’t think
of doing that but it makes sense, as does momentum.
I already increased weight decay from the default values, and that visibly
improved performance even through the noise. This seems reasonable to me
(though I’m not sure if my intuition is correct) - with a small dataset,
I’d think the training would be prone to overfitting, which increasing weight
decay should help.
I’ll try the Weights and Biases callback - is that this one:
By odd, do you mean it seems suspicious/indicating a possible problem?
Yeah, it’s not common in my experience. But how small is your dataset?
To be fair though, I’ve only worked with image datasets; the smallest I’ve worked with are 100 images 3 classes where the difference was subtle. Performance wasn’t great (~65% accuracy) but the results were consistent across multiple runs.
This is an image regression task, so I’m thinking it might have more variation in
the final metric since we’re taking the output directly for the metric, rather than
the argmax as we do for classification accuracy. Does that make sense?
The dataset is several hundred images, but relatively few positive examples
(with targets > 0).