How we beat the 5 epoch ImageWoof leaderboard score - some new techniques to consider

That default nonlinearity seems to be used all over the fastai repo… We should really investigate how much of a difference that makes.
Or maybe Jeremy already looked into it.

I did a quick test and didn’t see a difference between initializations, using Adam and resnet18.

(20 runs each, 5 epochs)
With fix: 0.6275
Without: 0.6304

%run train.py --run 10 --woof 1 --size 128 --bs 64 --mixup 0 --sa 0 --epoch 5 --lr 3e-3 --gpu 0
–opt adam --arch ‘xresnet18’

As i understand, init things matter more on short run and on more complex models - better try not resent 18, but resnet101 :grinning:

Tried with resnet101, 5 epochs

Fixed init:
[0.64 , 0.616, 0.624, 0.612, 0.63]: average 0.6244

Not fixed init:
[0.64 , 0.596, 0.624,0.598,0.622]: average 0.615

p = 0.36 so would need more runs to confirm

I wonder if the influence of init depends on the optimizer (I’m using Adam)

%run train.py --run 1 --woof 1 --size 128 --bs 64 --mixup 0 --sa 0 --epoch 5 --lr 3e-3 --gpu 0
–opt adam --arch ‘xresnet101’

1 Like

An interesting new technique is talked about here:

Paper https://arxiv.org/pdf/1910.00762v1.pdf

6 Likes

That does look interesting! Potentially quite useful for people who are looking to speed up training. I looked at their code (linked at the end of the abstract), its a little convoluted. I’d like to see a simple, clean implementation. I wonder if this could be done with the fastai callback system?

In your ImageWoofChampionship.ipynb notebook, Ranger + SimpleSelfAttention + MXResnet + Flatten Anneal session, I find that if I run flattenAnneal(learn, 4e-3, 5, 0.72) right after creating the learn object, it’s as good as your result, but if I run learn.lr_find() right before flattenAnneal(learn, 4e-3, 5, 0.72), the accuracy is much worse, everytime. Can you have a look or explain that?

@MichaelScofield There was a bit of a bug with the implementation (looks like it hasn’t been fixed yet?) where running an lr_find would actually update the weights for the optimizer itself resulting in this issue you’re experiencing.

@muellerzr Right, because if I reinitialize the Learner instance, the issue is gone. I thought the same about calling lr_find actually update the weights, so the bug is in Ranger optimizer?

Correct! And the reasoning behind it is the fact of of how Ranger uses LookAhead. I’m unsure what the fix is for this, @LessW2020 (if you know?)

1 Like

Hi @LessW2020, another question, is there any reason for normalizing the data with imagenet_stats when your models are not finetuned from pretrained imagenet weights? Isn’t it more reasonable to normalize with the dataset’s own mean and std?

Ideally yes you are right there. I know we’re (Misra) is working on getting more extensive training so something like a pretrained mxresnet can be a thing.

1 Like

You don’t normalise with the stats of the dataset the model was pretrained on, you normalize with the stats of the data you are using. So a non-pretrained model using imagenet data should still be normalised with imagenet stats. If you are using non-imagenet data with imagenet weights then you should normalise with the stats of your data not imagenet stats.
Think about what normalising is trying to achieve. It aims to create model inputs with a mean of 0 and an std of 1 (Z-scores as they are termed in statistics). So if the model was pretrained on imagenet data normalised with imagenet stats and you use your own data normalised with your own stats then in both cases the inputs will have a mean of 0 and std of 1. But normalising your own data with imagenet stats because that’s what the model was pretrained on would not achieve this.

So you are right that the model should be normalised using the weights of the input data but not because it’s not a pretrained model, you’d do the same with a pretrained imagenet model. However here the dataset used, ImageWoof, is a subset of imagenet so using the imagenet stats rather than calculating subset specific stats seems pretty reasonable. That’s why the fastai imagenette script uses imagenet stats in spite of not generally being intended for pretrained models.

3 Likes

Is it opposed to what Jeremy said here and in the docs that If using a pretrained model one need to use the same stats it was trained with?

1 Like

Interesting. Of course, maybe I’m wrong. My reply seems to make sense to me, standardise your ranges. On the other hand Jeremy’s replies don’t really make sense to me. From that thread:

For example, imagine you’re trying to classify different types of green frogs. If you were to use your own per-channel means from your dataset, you would end up converting them to a mean of zero, a standard deviation of one for each of your red, green, and blue channels. Which means they don’t look like green frogs anymore.

But isn’t that exactly the same as applying the imagenet stats to the Imagenette images as the train_imagenette.py script does. Presumably the imagenette image have about the same stats as imagenet, so it’s going to end up standardised to mean 0, std 1. And in some of the recent stuff Jeremy talked about the importance of ensuring your model inputs were mean 0 and std 1 in order to get off to a good start. Won’t suddenly throwing data with quite different means/stds in similarly throw it off?

Also, I’ve seen stuff aorund that with non-pretrained models divides by the mean/std of the data. Under the logic above wouldn’t that remove the uniqueness?

Also what exactly would dividing your data by imagenet mean and std achieve? How does that make your data more like imagenet data in a meaningful way?

(To be doubly clear, not really challenging here, just struggling to understand)

1 Like

Hi all - sorry I’ve been so out of touch but in the middle of a long section of work unrelated to deep learning.

Just to touch base and answer this issue:
@muellerzr is correct - what’s happening is when you run the lr_finder, it’s looping the network through a wild series of learning rates. In the background, Ranger is storing away every 5th run’s weights (one buffer, updated every 5 weights).
When you are done with lr_finder, you start your regular processing…but now Ranger is runnnig with it’s lookout weights that were set at some crazy high lr from the end of lr_finder.
Those crazy weights are then blended in during your actual training and it of course drags down the training.

The fix here is I need to add a Start_New_Training=1 or similar flag or possibly a “Not_Training” flag (so it doesn’t use mem buffer during lr_find), either of which will allow Ranger to be smart about how to deal with it’s in memory weight buffer.

Anyway, right now Ranger has no idea that the earlier runs were from an lr_finder vs. actual training and hence the confusion. If you blow away the learner and start fresh of course it does the same thing, but since a lot of people use lr_finder, then I think the fix is to either auto-detect lr_finder is active or else put in a control property to tell Ranger what to do.

On that note though, I have a RangerQH which I really like for using for production work (i.e. 80+ epoch type training) and hopefully I’ll have time to get back to it in the future and will try and update both Ranger’s with an appropriate flag to make it more context aware. :slight_smile:

*Edit - actually the newer version of Ranger (9-13-19) likely won’t have this issue as the weights are embedded into a dictionary and saved with the model (previous version used free memory). Thus when the lr_finder resets the learner, it should restore Ranger with it’s starting weights…but it may also be an issue that the lr_finder code has no concept that an optimizer even has a state/buffer to restore, so it may just ignore that entirely.

LRFinder will call a reset() method if present on any callbacks, presumably for just such a purpose, so just implement a reset().

Though it does also call learn.load() to load the state it saved at the start, so sounds like the new version shouldn’t have the issue (assuming state was initialised to be saved, it’s saved in on_train_begin).

2 Likes

On the Image{Nette/Woof} training scripts, I believe you don’t fit for more than 20 epochs.

If the start_pct = .72, that means the learning rate remains flat until 0.72*20 = ~epoch #14, correct? I’m curious to know if this is too long, or is there any “good” way to determine what’s a good start_pct.
More importantly, if one were training a model for 100 epochs, would it be wiser to have a start_pct = 0.14 (intuitively, 14 epochs feels like a long enough time to find a good loss space) or stick with start_pct = 0.72.

cc: @LessW2020 @muellerzr @grankin

@rsomani95 It works by batch not by epoch numbers. So it’s 72% of your total batches. Hence why we can still do 72% on 5 epochs

You can see this here where n is our total batches:

n = len(learn.data.train_dl)
anneal_start = int(n*num_epoch*annealing_start)
phase0 = TrainingPhase(anneal_start).schedule_hp('lr', lr)
phase1 = TrainingPhase(n*num_epoch - anneal_start).schedule_hp('lr', lr, anneal=annealing_cos)
2 Likes

Ah, I see. Thanks for the clarification.