How we beat the 5 epoch ImageWoof leaderboard score - some new techniques to consider

You can still use lr_finder to at least get an idea of the landscape. I just used the lr finder, picked that value and then also tested + and - 10x to start getting an idea and kind of honed in from there.
In general, it seemed to like slightly lower lr.
Re: ending lr - we just let it descend to near zero, but it’s possible flattening out sooner could be beneficial. That’s an area to test out further for sure.
Lastly, AutoOpt is being worked on now and may (may) solve the whole lr and momentum aspect automatically and optimally. Let’s see where it is in the next week.

2 Likes

Got it. Thanks.

Just a note for how applicable this stuff is, on some tabular research I’m doing, I beat my relative percent by ~3-4% (we’re well above 90% here) with statistical significance!

I intend to run this on Rossmann today as well and will make a separate post for those results due to its relation with tabular

One note I’ve found that 4e-3 is still a good enough LR over there too

Interestingly enough, some datasets will see an increased accuracy, whereas others won’t. I was able to get roughly a little below Jeremy’s exp_rmspe score on one hand. But in a project with my research I was able to achieve that statistically significant difference. One difference is that research project was not regression nor binary, it was multi class. I’ll have to look more into this behavior and if it’s limited by my 2 datasets I used, or if I can repeat it on others.

**Do note: by “won’t” I mean I achieved the same accuracy as w/o it.

4 Likes

How does this compare to using LAMB as an optimizer instead?

Also, if I understand correctly, Lookahead can be used with any optimizer. So has it been tried with LAMB? It seems you guys tried it with LARS, which I understand is somewhat similar, so I would expect good results with LAMB as well, no?

I don’t believe we’ve tried LAMB yet. I will run some tests and tonight I’ll update with how those went :slight_smile: (I’ll also post in the other forum post as well)

1 Like

@ilovescience Meet Mish: New Activation function, possible successor to ReLU? see this post

1 Like

How do you update to the new fastai library to use fit_fc()? I’m using colab and the version of fastai when I use !curl -s https://course.fast.ai/setup/colab | bash doesn’t seem to have learn.fit_fc()

Secondly, how does RAdam/Ranger/Etc. compare to adamW (or sgd with momentum) when trained for longer? I seem to recall a post on the forums that found these new optimizers actually performed worse when trained for 80 epochs.

You only need to run that for course-v3 for colab (I know this is all I use is colab)

Follow the command used here:

Run !pip install git+https://github.com/fastai/fastai.git to grab the most recent version

And then restart your instance and you should be importing what you need :slight_smile:

i have small improvement on 5 epochs.
I can’t reproduce results as in leaderboard, so my baseline on colab, same arguments is:
0.7412, std 0.011771156
[0.746 0.75 0.748 0.744 0.718]
Same, but act_fn Relu:
0.75720006 std 0.010007978
[0.744 0.766 0.758 0.77 0.748]
And with LeakyRelu:
0.7576 std 0.0058514797
[0.758 0.756 0.748 0.766 0.76 ]
Here results: https://gist.github.com/ayasyrev/eb91d64b219e36898ea952355f239586
Most important here - when i tested different activations, it was strange results and i began check everething.
And i find bug in xresnet implementation (so in mxresnet too)!
In func init_cnn, we init model as nn.init.kaiming_normal_. But default argument is: nonlinearity=‘leaky_relu’
So - i change it to nonlinearity=‘relu’ and got better result. Same for LeakyRelu.
There is no implementation in torch for Mish - so may be it place for better result!

3 Likes

Interesting find!

I wondered whether this “bug” was there in the “imagenet in 18 minutes” code, but this is what I found:

nn.init.kaiming_normal_(m.weight, mode=‘fan_out’, nonlinearity=‘relu’) [1]

(we should look into ‘fan_out’ as well…)

So it seems that in our tests ReLU was doing artificially worse because of the wrong init. And yes there is no implementation for Mish, but it might be closer to leaky ReLU…

[1] https://github.com/cybertronai/imagenet18_old/search?q=nn.init.kaiming_normal_&unscoped_q=nn.init.kaiming_normal_

1 Like

That default nonlinearity seems to be used all over the fastai repo… We should really investigate how much of a difference that makes.
Or maybe Jeremy already looked into it.

I did a quick test and didn’t see a difference between initializations, using Adam and resnet18.

(20 runs each, 5 epochs)
With fix: 0.6275
Without: 0.6304

%run train.py --run 10 --woof 1 --size 128 --bs 64 --mixup 0 --sa 0 --epoch 5 --lr 3e-3 --gpu 0
–opt adam --arch ‘xresnet18’

As i understand, init things matter more on short run and on more complex models - better try not resent 18, but resnet101 :grinning:

Tried with resnet101, 5 epochs

Fixed init:
[0.64 , 0.616, 0.624, 0.612, 0.63]: average 0.6244

Not fixed init:
[0.64 , 0.596, 0.624,0.598,0.622]: average 0.615

p = 0.36 so would need more runs to confirm

I wonder if the influence of init depends on the optimizer (I’m using Adam)

%run train.py --run 1 --woof 1 --size 128 --bs 64 --mixup 0 --sa 0 --epoch 5 --lr 3e-3 --gpu 0
–opt adam --arch ‘xresnet101’

1 Like

An interesting new technique is talked about here:

Paper https://arxiv.org/pdf/1910.00762v1.pdf

6 Likes

That does look interesting! Potentially quite useful for people who are looking to speed up training. I looked at their code (linked at the end of the abstract), its a little convoluted. I’d like to see a simple, clean implementation. I wonder if this could be done with the fastai callback system?

In your ImageWoofChampionship.ipynb notebook, Ranger + SimpleSelfAttention + MXResnet + Flatten Anneal session, I find that if I run flattenAnneal(learn, 4e-3, 5, 0.72) right after creating the learn object, it’s as good as your result, but if I run learn.lr_find() right before flattenAnneal(learn, 4e-3, 5, 0.72), the accuracy is much worse, everytime. Can you have a look or explain that?

@MichaelScofield There was a bit of a bug with the implementation (looks like it hasn’t been fixed yet?) where running an lr_find would actually update the weights for the optimizer itself resulting in this issue you’re experiencing.

@muellerzr Right, because if I reinitialize the Learner instance, the issue is gone. I thought the same about calling lr_find actually update the weights, so the bug is in Ranger optimizer?

Correct! And the reasoning behind it is the fact of of how Ranger uses LookAhead. I’m unsure what the fix is for this, @LessW2020 (if you know?)

1 Like