How we beat the 5 epoch ImageWoof leaderboard score - some new techniques to consider

Hi all,
I’m very happy to announce that after a lot of work, we’ve finally been able to pretty soundly improve the leaderboard score for 5 epoch Imagewoof.
Updates -
1 - More than just for 5 epochs: we’ve now used the same techniques for both Nette and ImageWoof, 5 and 20epochs, with all px size variations and beaten the previous records in every case (12 total cases). Continued out-performance on the 20 epochs was encouraging.
2 - Thanks to @muellerzr, if you update to the latest version of fastAI you can now use:
learn.fit_fc() to run the flat + cosine anneal we used in place of fit_one_cycle, and try it out for yourself.

It took a lot of new techniques combined and instead of reading the long Mish activation function thread where most of this happend, I thought it might be better to try and summarize here. This way, you can quickly see an overview of new techniques that we had success with, and see if they might be of use in your own work and you’ll be aware of what’s out there now.

(And to clarify when I say we - @Seb / @muellerzr / @grankin / @Redknight / @oguiza / @fgfm all made vital contributions to this effort).

First, here was the original ImageWoof 5 epoch, 128 score since March:

and here’s our latest result as of today:

As you can see, it’s quite a nice jump. There are multiple changes involved to get there:

1 - Optimizer - we changed to use Ranger instead of Adam. Ranger came from combining RAdam + LookAhead.
The summary of each - RAdam (Rectified Adam) was the result of Microsoft research looking into why all adaptive optimizers need a warmup or else they usually shoot into bad optima. The answer was because it’s making premature jumps when the variance of the adaptive learning rate is too high. In short, it needs to wait to see more data to really start making larger decisions…and they achieve this automatically by adding in a rectifier that dynamically tamps down the adaptive learning rate until the variance settles. Thus, no warmup needed!
LookAhead - LookAhead was developed in part by Geoffrey Hinton (you may have heard of him?)keeps a separate set of weights from the optimizer, and then every k steps (5 or 6) interpolates between it’s weights and the optimizer and then splits the difference. The result is like having a buddy system to explore the loss terrain…the faster optimizer explores, LookAhead lets it scout while making sure it can be pulled back if it’s a bad optima. Thus, safer / better exploration and faster convergence. ( have a more detailed articles on both, which I’ll link below).
Note that RangerLars (Ranger + Layer Wise Adaptive Rate Scaling) is actively being worked on and looks really good as well (also called Over9000, but @Redknight developed along with @grankin / @oguiza / @fgfm / @muellerzr are rapidly driving this forward). It’s very close with Ranger and may surpass it in the future.

2 - Flat + Cosine annealing training curve instead of OneCycle - this is the invention of @grankin, and matched what I was suspecting after testing out a lot of new optimizers. Namely, the OneCycle appears to do well with vanilla Adam, but all that up and down tends to mess up the newer optimizers.
Thus, instead of cycling the learning rate up and down, we simply use a flat start (i.e. 4e-3), and then at 50% - 75%, start dropping the learning rate based on a cosine anneal.

That made a big performance jump regardless of the optimizer (except Adam). I had tested Novograd, stochastic Adam, etc.

3 - Mish activation function instead of ReLU - Mish is a new activation function that was released in a paper about a week ago. It has a much smoother curve vs relu, and in theory, that drives information more deeply through the network. It is slower than ReLU, but on average adds about 1-3% improvement vs ReLU. (details in article link below). There’s an MXResNet now to make it easy to use XResNet + Mish (link below).

4 - Self attention layer - this is @Seb’s brainchild, along with input from @grankin, so I’m stretching to describe it but it’s a small layer added to MXResNet / XResNet. The self attention layer is designed to help leverage long range dependencies within an image vs the more local feature focus of convolutions. (Original paper link below, but maybe @Seb will do a post on it in more detail).

5 - Resize Image quality - Finally, we found that resizing images to 128 from the original ImageWoof images (instead of resizing from 160->128) produces higher quality that literally adds ~2% accuracy. @fgfm confirmed the reason behind that, but it was a bit of a surprise. Thus, you’ll want to pay attention to how you are getting to your image size as it does have a non-trivial impact.

ImageNette/Woof serve a great gatekeeper function: One thing I should add is that the value of having these test datasets like ImageNette and especially ImageWoof (due to being harder) is quite great. I tested a lot of activation functions this year before Mish, and in most cases while things looked awesome in the paper, they would fall down as soon as I put them to use on more realistic toy datasets like ImageNette/Woof.
Many of the papers show results using MNIST or CIFAR-10, which really has minimal proof of how they will truly fare in my experience.

Thus, a big thanks to @jeremy for making these as it really does serve as an important gatekeeper to quickly testing what has real promise and what does not!

That’s the quick overview - now for some links if you want to dig deeper:

If you want to test out MXResNet and all the changes above -
There’s a full github repo with a training workbook so you can readily test out these features and ideally improve on them. You can run with and without self attention layer as well via --sa 1 parameter:

Here’s some more reading info to learn more about Ranger / Mish and Self Attention:
Mish activation:

and github for it:

Ranger optimizer:

Self- Attention: (and see @Seb for more :slight_smile:

Most of the developments ended up happening in this Mish thread:

Finally @muellerzr put together a nice list of all the relevant papers (including some like Novograd that were tested but didn’t ultimately get used):

Anyway, exciting times for deep learning as a lot of new ideas make their way into testing and possibly long term success.



Thanks Less for doing the writeup!

One more thing that helped a bit was fixing an oversight in the training code that divided the learning rate by 4.

Edit to add: we also switched channels from sizes = [c_in,32,32,64] to sizes = [c_in,32,64,64]

If I had to pick one thing to try first, it would be Radam+LookAhead with the modified learning rate schedule.

SimpleSelfAttention is still research in progress. I have a github here:
The readme seems a bit unreadable, but the code should be easy to run if you’ve run xresnet. I can answer questions if you have any.


Thanks Less for the write-up and thanks to everyone involved for the experience and work we got in! It was certainly something I had never experienced before and I thoroughly enjoyed the strong competitive environment we had going on :slight_smile: Here is a notebook (on the repo Less mentioned above) where I laid out 7 of the attempts we did: notebook

Specifically, they are only 1 run of five epochs, due to Colab, but they should be set up to get close to what we were achieving. The list is as follows:

  • Baseline (Adam + xResnet50) + OneCycle
  • Ranger (RAdam + LookAhead) + OneCycle
  • Ranger + Flatten Anneal
  • Ranger + MXResnet (xResnet50 + Mish) + Flatten Anneal
  • RangerLars (Ralamb + LARS + Ranger) + MXResnet + Flatten Anneal
  • RangerLars + xResnet50 + Flatten Anneal
  • Ranger + SimpleSelfAttention + MXResnet + Flatten Anneal

Some further thoughts carried over from the other post:

Perhaps trying out MixUp?
Perhaps try CutOut - Less

In any case, please use those as base ideas and see how they improve! Those further ideas are untested :slight_smile:

And thanks again everyone involved. It was quite the fun and rewarding experience!


Grate job! Thanks all!
Testing some combination activation / optimisation / scheduling too
Mesh is very good!
Experiments with optimisers.

1 Like

Have you looked at how these tweaks impact results over a long training period?

Something I always struggle with when trying to improve models is disentangling effects that lead to faster training from effects that lead to a higher overall accuracy at convergence.

Model A reaches 75% accuracy after 5 epochs, and 95% accuracy after 100 epochs.
Model B reaches 55% accuracy after 5 epochs, and 99% accuracy after 100 epochs.

We can see the difference between faster convergence to a lower accuracy, vs slower convergence to a higher accuracy.

Or in the case of tweaking a model:
Base Model - 75% accuracy after 5 epochs, 94% accuracy after 50 epochs
Model + tweak - 78% accuracy after 5 epochs

The tweak caused the 5 epoch accuracy to increase by 3%. Will this translate to an accuracy increase at 50 epochs, or will the model + tweak converge to the same accuracy, just faster?

This stuff gets tricky, and I don’t have a good sense of how to probe it. The brute force way is to overtrain all your models, but this is usually a waste of time, especially on limited compute. I think this is also a big issue when evaluating strong regularization effects like Mixup. We expect strong regularization to make it more difficult for the model to train, likely requiring more epochs to reach similar or improved performance over a model without regularization.

I’m sure this also impacts published literature. How many published improvements from model tweaks are secretly just training longer?

I know @Seb did some 80 epoch tests, he could answer more to what he found

We definitely need to run longer tests. The 80 epochs tests have to be rerun because I didn’t use the best learning rate schedule for them (annealing started halfway through when it should probably start much later on).

1 Like

Hi @KarlH,
Yes, you are completely correct in terms of short term success does not necessarily mean long term. SGD for example can often beat out a lot of things… if you are willing let it run long enough. And of course, long term is also a vague concept (i.e. 80, 200, 400, etc).
In some cases, like Mish, I believe the theory is there that it should produce a more robust nn, short and long term.
Regardless, we have to consider these as ‘gateway’ tests that indicate these new tools and concepts have definite merit, but still will have to be proven over the longer term. I’d be really interested to apply these towards a Kaggle competition or commercial production AI where we’d have a financial incentive to prove them out for the long term.
That’s in part b/c the cost of resources to do that kind of testing (long term) vs 5 epochs…I already spent quite a bit just on GPU time for 5 epoch work, no plans to spend more to do the same amount of testing for 80 for example. :slight_smile:
Thanks for the feedback!

For onecycle we can use the lr_find to pick a starting learning rate. How do you pick a constant learning rate when working with Ranger? And for the final value of lr after annealing, how do you decide that value?

You can still use lr_finder to at least get an idea of the landscape. I just used the lr finder, picked that value and then also tested + and - 10x to start getting an idea and kind of honed in from there.
In general, it seemed to like slightly lower lr.
Re: ending lr - we just let it descend to near zero, but it’s possible flattening out sooner could be beneficial. That’s an area to test out further for sure.
Lastly, AutoOpt is being worked on now and may (may) solve the whole lr and momentum aspect automatically and optimally. Let’s see where it is in the next week.


Got it. Thanks.

Just a note for how applicable this stuff is, on some tabular research I’m doing, I beat my relative percent by ~3-4% (we’re well above 90% here) with statistical significance!

I intend to run this on Rossmann today as well and will make a separate post for those results due to its relation with tabular

One note I’ve found that 4e-3 is still a good enough LR over there too

Interestingly enough, some datasets will see an increased accuracy, whereas others won’t. I was able to get roughly a little below Jeremy’s exp_rmspe score on one hand. But in a project with my research I was able to achieve that statistically significant difference. One difference is that research project was not regression nor binary, it was multi class. I’ll have to look more into this behavior and if it’s limited by my 2 datasets I used, or if I can repeat it on others.

**Do note: by “won’t” I mean I achieved the same accuracy as w/o it.


How does this compare to using LAMB as an optimizer instead?

Also, if I understand correctly, Lookahead can be used with any optimizer. So has it been tried with LAMB? It seems you guys tried it with LARS, which I understand is somewhat similar, so I would expect good results with LAMB as well, no?

I don’t believe we’ve tried LAMB yet. I will run some tests and tonight I’ll update with how those went :slight_smile: (I’ll also post in the other forum post as well)

1 Like

@ilovescience Meet Mish: New Activation function, possible successor to ReLU? see this post

1 Like

How do you update to the new fastai library to use fit_fc()? I’m using colab and the version of fastai when I use !curl -s | bash doesn’t seem to have learn.fit_fc()

Secondly, how does RAdam/Ranger/Etc. compare to adamW (or sgd with momentum) when trained for longer? I seem to recall a post on the forums that found these new optimizers actually performed worse when trained for 80 epochs.

You only need to run that for course-v3 for colab (I know this is all I use is colab)

Follow the command used here:

Run !pip install git+ to grab the most recent version

And then restart your instance and you should be importing what you need :slight_smile:

i have small improvement on 5 epochs.
I can’t reproduce results as in leaderboard, so my baseline on colab, same arguments is:
0.7412, std 0.011771156
[0.746 0.75 0.748 0.744 0.718]
Same, but act_fn Relu:
0.75720006 std 0.010007978
[0.744 0.766 0.758 0.77 0.748]
And with LeakyRelu:
0.7576 std 0.0058514797
[0.758 0.756 0.748 0.766 0.76 ]
Here results:
Most important here - when i tested different activations, it was strange results and i began check everething.
And i find bug in xresnet implementation (so in mxresnet too)!
In func init_cnn, we init model as nn.init.kaiming_normal_. But default argument is: nonlinearity=‘leaky_relu’
So - i change it to nonlinearity=‘relu’ and got better result. Same for LeakyRelu.
There is no implementation in torch for Mish - so may be it place for better result!


Interesting find!

I wondered whether this “bug” was there in the “imagenet in 18 minutes” code, but this is what I found:

nn.init.kaiming_normal_(m.weight, mode=‘fan_out’, nonlinearity=‘relu’) [1]

(we should look into ‘fan_out’ as well…)

So it seems that in our tests ReLU was doing artificially worse because of the wrong init. And yes there is no implementation for Mish, but it might be closer to leaky ReLU…


1 Like

That default nonlinearity seems to be used all over the fastai repo… We should really investigate how much of a difference that makes.
Or maybe Jeremy already looked into it.