ImageNette/Woof Leaderboards - guidelines for proving new high scores?

Sometimes I wonder if arxiv is full of results that haven’t been tested for statistical significance, because who would run a model on Imagenet more than once? But I reassure myself by assuming that variance must be much smaller when using the Imagenet test set.

I think that sounds really good - it needs to be both simple yet robust.
So basic guidelines:
*5 runs total, and best result from each set (i.e. 20 epoch x5) for final average.
*I also agree that we should be looking for 1%+ type improvements… a .3% jump is not interesting.
*GPU is not a factor other than reporting how it was run - the best score is the best score whether on 1 GPU, 4, 8 etc.

1 Like

Agree.

Completely disagree :slight_smile: It’s not an entrants job to go back and see whether the current leader really optimized their architecture/lr optimally. I have to assume, and will always assume, that if Jeremy, you or whomever is the current leaderboard holder, they used proper LR selection.
If the leader didn’t, then, someone simply picking a better LR through whatever means by rights is the new leaderboard holder, b/c they showed a better result.
Example - if the current record gets beat by someone showing the some crazy high rate like 1e-1 works great and trounces the old 3e-3 …well congrats you made the discovery about what lr works better and you get the new high score. It’s good info for everyone to see that crazy high rates work great on this architecture.
That’s an example but hopefully makes my point that it’s not at all the entrants job to go back and reprove that the current holder optimized their architecture/lr.
Now, that said, personally I’m after testing new things like better activation functions, new optimizers to try and compare vs leaderboard so I’m not likely to be putzing around with lr and claim it’s a new score, but if a better lr works better, then it’s a new entrant regardless imo.

Yes for sure and thanks for all your feedback above. I think the leaderboards are a great way to help test new things and see if it’s really making progress or not.
I have tested so many things from papers and very few (two actually) have shown better scores. I think a lot of papers don’t really hold up on unseen datasets so ImageNette and the leaderboard serve as a great proving ground for testing out new ideas.

Yeah I agree ideally the entrants wouldn’t have to retest the baseline. However at this point it seems necessary if we want to make any conclusion on a new idea.

I’ll probably be satisfied once we rework the baselines, And then we can assume that new entrants have picked the best parameters for their own entry.

Edit to add: also, although it might make the tables too big, I’d like to see more than just the best entry in the leaderboard. Maybe there’s a good way to do that.

1 Like

Right - if someone gets a clearly better result, I want to show that result on the leaderboard, along with the details of how they got it! :slight_smile:

1 Like

I finally re-figured out what the issue with current baseline is. I will detail it here, and I think you will see why I’ve been telling people to be cautious when comparing new ideas to the leaderboard.

Baseline runs code in train_imagenette.py [1].

Check this part of the code:

bs_rat = bs/256
if gpu is not None: bs_rat = num_distrib()
if not gpu: print(f’lr: {lr}; eff_lr: {lr
bs_rat}; size: {size}; alpha: {alpha}; mom: {mom}; eps: {eps}’)
lr *= bs_rat

When I run train_imagenette on 1 GPU, with bs = 64, my learning rate gets divided by 4! My understanding is that, with 4 GPUs, the learning rate stays the same but we would want to increase it.
I think this is a relic of having a hardcoded bs of 256 with train_imagenet.py [2] …

Let’s compare some results between using intended lr/4 and intended lr:

Dataset Epochs Size Accuracy Params GPUs
Imagenette 5 128 85.36% [4] %run train_imagenette.py --epochs 5 --bs 64 --lr 12e-3 --mixup 0 1
Imagenette 5 128 82.9% [3] %run train_imagenette.py --epochs 5 --bs 64 --lr 3e-3 --mixup 0 1

First line has a learning rate of 12e-3 but an effective lr of 3e-3. Second line: lr=3e-3, eff lr = 0.00075.

[1] https://github.com/fastai/fastai/blob/master/examples/train_imagenette.py
[2] https://github.com/fastai/fastai/blob/master/examples/train_imagenet.py
[3] np.mean([83.8,83.8,81.8,82.4,81.8,85,85,80.4,83,82])
[4] np.mean([86,85.4,84.4,85.2,84.8,85,85.6,85.4,85.4,86.4])

3 Likes

Nice work @Seb!
I couldn’t imagine how number of GPU would affect results but this makes sense that the LR was not adjusted properly.

1 Like

Regarding # of GPUs, IIRC, if we add more GPUs, we can increase BS, and thus increase LR.
It’s more of a rule of thumb so results may vary and we might not want to add this extra variable if we are testing an idea against baseline. Depends what your goal is.

Increasing # gpus effectively also increases batch size. And batch size is meant to (roughly) scale with learning rate. That’s why that line of code is there. It would be certainly interesting to hear of examples where it doesn’t work well.

I suggest replacing (in train_imagenette.py) the hardcoded 256 divisor by the input BS so that eff lr = lr for 1 GPU. Then it will take some work to redo the baseline with the intended lr (and ideally a sample size>1)…

It’s fairly common to specify LR @ BS of 256 (or some other k) and then scale according to current runtime capabilities. In most cases it makes the results more consistent for those not fully aware of what’s going on. It is helpful to remind in comments/help text that LR is @ BS of 256 if you do that scaling for the user. I do prefer to specify and calculate the effective LR myself though.

There are numerous other reasons why comparing results of 4 GPU vs 1 GPU training can be problematic…

Batch norm is probably the biggest one. Without synchronized batch norm, you’re using BN stats from one of the N GPUs. This isn’t necessarily a bad thing, sometimes it can be a benefit when BS is big, but it is a significant change from the single GPU case. Even if you enable synchronized BN, the synchronized stats end up a little different. I feel the performance hit of sync_bn is not worth it until you’re in the really small batch size realm.

Validation. Typically, if the validation is also being done on N GPUs with a distributed sampler it will not be quite correct unless your dataset % N == 0. Extra samples are inserted and most impl don’t bother to (or can’t easily) remove their impact from the resulting reduction at the end. I always re-run validation at the end with 1 GPU or DP instead of DDP for final comparisons.

2 Likes

There’s actually some evidence that batchnorm with large batch sizes decreases generalization. See https://arxiv.org/pdf/1705.08741.pdf

That paper suggests using a “ghost” batchnorm that reduces the effective batch size by applying batchnorm to “virtual” minibatches.

There’s also some empirical evidence from Myrtle.ai (who recently trained Cifar10 to 94% accuracy in 34 seconds) that large batch batchnorm performed worse compared to ghost batchnorm with an effective batch size of 32.

2 Likes

I believe the correct dataset in this table is Imagenette, since woof=0 by default.

1 Like

Good catch, I corrected it.

1 Like

I propose to reconsider using OneCycle. RAdam and Novograd - both don’t need warmup, as opposite to Adam. We can utilise this property and introduce different learning rate policy.

I’ve used flat LR and then cosine annealing.

I did simple script to run training 20 times 5 epoch and calculate mean and std. So far looks good.

updated.

https://github.com/mgrankin/over9000

Imagenette 128 scored 0.8746 on 20 runs with Over9000. That is 1.69% bigger than LB. Imagewoof did +2.89%.

5 Likes

I think that the reason why RAdam/Ranger scored worse that Adam is the LR schedule. This new LR schedule (flat and annealing) is the first thing that came to my mind, it could be not great. There should be the LR schedule for RAdam that will be as great as OneCycle for Adam.

1 Like

Great work! Looks like we have a winner.

Can I suggest running Adam with flat and anneal?

A small detail: Your xresnet [1] uses channel sizes = [c_in,32,64,64], while the one on the fastai rep [2] has sizes = [c_in,32,32,64]. (IIRC the different version comes from a fastai course notebook, which I used, and then it went to you)
It actually might make results a tiny bit better.

[1] https://github.com/mgrankin/over9000/blob/master/xresnet.py
[2] https://github.com/fastai/fastai/blob/master/fastai/vision/models/xresnet.py

Oh, I’ll change that to [c_in,32,32,64]. Thank you for pointing this out.

I would leave it as [c_in,32,64,64], you’ve done all your runs with the other version so far, and it seems to do better anyway. I just wanted to make sure that detail doesn’t confuse someone later on.

To be back on the “guidelines” topic, I think @grankin’s github is a good example of what I’d like to see when someone updates the leaderboard: easy to run code, as well as a notebook with organized and detailed results. Obviously, future entrants won’t have to rerun the baseline.

One issue could be if a repo gets updated or deleted after entering the leaderboard. Maybe we need a snapshot fork of the repo and have that be the link. Or have entrants just add their stuff in a folder on the fastai repo.