Any tips on hyperparameter tuning?

I’ve been studying Fastai for a while, and am now using
it to participate in an image-related ML competition.
I’ve started to get some decent-looking results, but
am getting a bit stuck on how to tune the hyperparameter
choices. Specifically, there is quite a bit of random
noise in my results - the final metric’s value on the
validation set (MSE on a collection of predicted labels)
can vary by as much as 10% between training runs with
the same set of hyperparameters. I’m therefore having
trouble distinguishing when a hyperparameter change
is actually improving results.

Any tips on dealing with this situation? Do you try
to find choices that give dramatic (>10%) improvement?
Or to make the training more repeatable? Any other ideas?

That’s odd/interesting. How many training runs did you try?

The most important hyperparam is the learning rate. I’d suggest running the Learning Rate Finder – learn.lr_find() – right after you create the Learner.

Once you get that right and start seeing consistent results, you could start playing around with wd, different optimisers and loss functions to see what works best.

1 Like

Is this a small dataset? This can definitely be more dramatic on smaller datasets. In that case you may have to look at alternative validation strategies such as k-fold.

Off the top of my head increasing momentum, square momentum, batch size, epsilon might make things less variable. Batch size has the least negative trade-offs from increasing it, so if you can I would increase that first.
Momentum - more of moving average weights are used
Square Momentum - same as momentum but the squares and is a divisor
Batch size - effectively will average out the gradients. If you are using a small batch size this might make a huge difference (<32)
Epsilon - Can offset square average being too small, and make training a bit more stable

I have no idea what the effect of weight decay or dropout would have on the variability of the loss.

I would suggest setting up the Weights and Biases callback. It helps visualize a lot of these variables and can help you get an intuitive understanding of these things.

2 Likes

Take a look at the sweep functionality of weights and biases, it may help you with the tuning!

2 Likes

While hyperparameter tuning can be a rigorous and costly process sometimes, I believe with good practical experience for many of the problems a practitioner can start with good hyperparameter values while starting their experiments. Josh Tobin hinted about it in this presentation: josh-tobin.com/assets/pdf/troubleshooting-deep-neural-networks-01-19.pdf.

As Victor mentioned Sweeps from Weights and Biases are superb for running hyperparameter searches efficiently. Here are some guides:

5 Likes

I’ve never used them, but I have a friend that swears by genetic algorithms. There are some python packages that help you implement them.

Many thanks, Molly!

It is a pretty small dataset, so that could be increasing the swings.
I’ve tried cross validation, but still get significant variation,
and of course it takes much longer.

I do have some room to increase batch size, so will try that - didn’t think
of doing that but it makes sense, as does momentum.

I already increased weight decay from the default values, and that visibly
improved performance even through the noise. This seems reasonable to me
(though I’m not sure if my intuition is correct) - with a small dataset,
I’d think the training would be prone to overfitting, which increasing weight
decay should help.

I’ll try the Weights and Biases callback - is that this one:

Thanks, Sayak and Victor!
I’ll check those pages out, they look quite interesting.

Thanks, Rahul!

By odd, do you mean it seems suspicious/indicating a possible problem?
Do you usually not see this much variability? It is a pretty small dataset.

I have been using lr_find to pick learning rates.

Thanks, Daniel.

I’ve never used genetic algorithms, but people seem to be divided between swearing by them
and swearing at them …

1 Like

By odd, do you mean it seems suspicious/indicating a possible problem?

Yeah, it’s not common in my experience. But how small is your dataset?

To be fair though, I’ve only worked with image datasets; the smallest I’ve worked with are 100 images 3 classes where the difference was subtle. Performance wasn’t great (~65% accuracy) but the results were consistent across multiple runs.

1 Like

You’re welcome! :smiley:
Yeah, My friend uses them for algotrading

Good luck on the competition!!!

This is an image regression task, so I’m thinking it might have more variation in
the final metric since we’re taking the output directly for the metric, rather than
the argmax as we do for classification accuracy. Does that make sense?

The dataset is several hundred images, but relatively few positive examples
(with targets > 0).

Ah I see. In that case, the variation does make sense