How we beat the 5 epoch ImageWoof leaderboard score - some new techniques to consider

How does this compare to using LAMB as an optimizer instead?

Also, if I understand correctly, Lookahead can be used with any optimizer. So has it been tried with LAMB? It seems you guys tried it with LARS, which I understand is somewhat similar, so I would expect good results with LAMB as well, no?

I don’t believe we’ve tried LAMB yet. I will run some tests and tonight I’ll update with how those went :slight_smile: (I’ll also post in the other forum post as well)

1 Like

@ilovescience Meet Mish: New Activation function, possible successor to ReLU? see this post

1 Like

How do you update to the new fastai library to use fit_fc()? I’m using colab and the version of fastai when I use !curl -s https://course.fast.ai/setup/colab | bash doesn’t seem to have learn.fit_fc()

Secondly, how does RAdam/Ranger/Etc. compare to adamW (or sgd with momentum) when trained for longer? I seem to recall a post on the forums that found these new optimizers actually performed worse when trained for 80 epochs.

You only need to run that for course-v3 for colab (I know this is all I use is colab)

Follow the command used here:

Run !pip install git+https://github.com/fastai/fastai.git to grab the most recent version

And then restart your instance and you should be importing what you need :slight_smile:

i have small improvement on 5 epochs.
I can’t reproduce results as in leaderboard, so my baseline on colab, same arguments is:
0.7412, std 0.011771156
[0.746 0.75 0.748 0.744 0.718]
Same, but act_fn Relu:
0.75720006 std 0.010007978
[0.744 0.766 0.758 0.77 0.748]
And with LeakyRelu:
0.7576 std 0.0058514797
[0.758 0.756 0.748 0.766 0.76 ]
Here results: https://gist.github.com/ayasyrev/eb91d64b219e36898ea952355f239586
Most important here - when i tested different activations, it was strange results and i began check everething.
And i find bug in xresnet implementation (so in mxresnet too)!
In func init_cnn, we init model as nn.init.kaiming_normal_. But default argument is: nonlinearity=‘leaky_relu’
So - i change it to nonlinearity=‘relu’ and got better result. Same for LeakyRelu.
There is no implementation in torch for Mish - so may be it place for better result!

3 Likes

Interesting find!

I wondered whether this “bug” was there in the “imagenet in 18 minutes” code, but this is what I found:

nn.init.kaiming_normal_(m.weight, mode=‘fan_out’, nonlinearity=‘relu’) [1]

(we should look into ‘fan_out’ as well…)

So it seems that in our tests ReLU was doing artificially worse because of the wrong init. And yes there is no implementation for Mish, but it might be closer to leaky ReLU…

[1] https://github.com/cybertronai/imagenet18_old/search?q=nn.init.kaiming_normal_&unscoped_q=nn.init.kaiming_normal_

1 Like

That default nonlinearity seems to be used all over the fastai repo… We should really investigate how much of a difference that makes.
Or maybe Jeremy already looked into it.

I did a quick test and didn’t see a difference between initializations, using Adam and resnet18.

(20 runs each, 5 epochs)
With fix: 0.6275
Without: 0.6304

%run train.py --run 10 --woof 1 --size 128 --bs 64 --mixup 0 --sa 0 --epoch 5 --lr 3e-3 --gpu 0
–opt adam --arch ‘xresnet18’

As i understand, init things matter more on short run and on more complex models - better try not resent 18, but resnet101 :grinning:

Tried with resnet101, 5 epochs

Fixed init:
[0.64 , 0.616, 0.624, 0.612, 0.63]: average 0.6244

Not fixed init:
[0.64 , 0.596, 0.624,0.598,0.622]: average 0.615

p = 0.36 so would need more runs to confirm

I wonder if the influence of init depends on the optimizer (I’m using Adam)

%run train.py --run 1 --woof 1 --size 128 --bs 64 --mixup 0 --sa 0 --epoch 5 --lr 3e-3 --gpu 0
–opt adam --arch ‘xresnet101’

1 Like

An interesting new technique is talked about here:

Paper https://arxiv.org/pdf/1910.00762v1.pdf

6 Likes

That does look interesting! Potentially quite useful for people who are looking to speed up training. I looked at their code (linked at the end of the abstract), its a little convoluted. I’d like to see a simple, clean implementation. I wonder if this could be done with the fastai callback system?

In your ImageWoofChampionship.ipynb notebook, Ranger + SimpleSelfAttention + MXResnet + Flatten Anneal session, I find that if I run flattenAnneal(learn, 4e-3, 5, 0.72) right after creating the learn object, it’s as good as your result, but if I run learn.lr_find() right before flattenAnneal(learn, 4e-3, 5, 0.72), the accuracy is much worse, everytime. Can you have a look or explain that?

@MichaelScofield There was a bit of a bug with the implementation (looks like it hasn’t been fixed yet?) where running an lr_find would actually update the weights for the optimizer itself resulting in this issue you’re experiencing.

@muellerzr Right, because if I reinitialize the Learner instance, the issue is gone. I thought the same about calling lr_find actually update the weights, so the bug is in Ranger optimizer?

Correct! And the reasoning behind it is the fact of of how Ranger uses LookAhead. I’m unsure what the fix is for this, @LessW2020 (if you know?)

1 Like

Hi @LessW2020, another question, is there any reason for normalizing the data with imagenet_stats when your models are not finetuned from pretrained imagenet weights? Isn’t it more reasonable to normalize with the dataset’s own mean and std?

Ideally yes you are right there. I know we’re (Misra) is working on getting more extensive training so something like a pretrained mxresnet can be a thing.

1 Like

You don’t normalise with the stats of the dataset the model was pretrained on, you normalize with the stats of the data you are using. So a non-pretrained model using imagenet data should still be normalised with imagenet stats. If you are using non-imagenet data with imagenet weights then you should normalise with the stats of your data not imagenet stats.
Think about what normalising is trying to achieve. It aims to create model inputs with a mean of 0 and an std of 1 (Z-scores as they are termed in statistics). So if the model was pretrained on imagenet data normalised with imagenet stats and you use your own data normalised with your own stats then in both cases the inputs will have a mean of 0 and std of 1. But normalising your own data with imagenet stats because that’s what the model was pretrained on would not achieve this.

So you are right that the model should be normalised using the weights of the input data but not because it’s not a pretrained model, you’d do the same with a pretrained imagenet model. However here the dataset used, ImageWoof, is a subset of imagenet so using the imagenet stats rather than calculating subset specific stats seems pretty reasonable. That’s why the fastai imagenette script uses imagenet stats in spite of not generally being intended for pretrained models.

3 Likes