ImageNette/Woof Leaderboards - guidelines for proving new high scores?

I suggest replacing (in train_imagenette.py) the hardcoded 256 divisor by the input BS so that eff lr = lr for 1 GPU. Then it will take some work to redo the baseline with the intended lr (and ideally a sample size>1)…

It’s fairly common to specify LR @ BS of 256 (or some other k) and then scale according to current runtime capabilities. In most cases it makes the results more consistent for those not fully aware of what’s going on. It is helpful to remind in comments/help text that LR is @ BS of 256 if you do that scaling for the user. I do prefer to specify and calculate the effective LR myself though.

There are numerous other reasons why comparing results of 4 GPU vs 1 GPU training can be problematic…

Batch norm is probably the biggest one. Without synchronized batch norm, you’re using BN stats from one of the N GPUs. This isn’t necessarily a bad thing, sometimes it can be a benefit when BS is big, but it is a significant change from the single GPU case. Even if you enable synchronized BN, the synchronized stats end up a little different. I feel the performance hit of sync_bn is not worth it until you’re in the really small batch size realm.

Validation. Typically, if the validation is also being done on N GPUs with a distributed sampler it will not be quite correct unless your dataset % N == 0. Extra samples are inserted and most impl don’t bother to (or can’t easily) remove their impact from the resulting reduction at the end. I always re-run validation at the end with 1 GPU or DP instead of DDP for final comparisons.

2 Likes

There’s actually some evidence that batchnorm with large batch sizes decreases generalization. See https://arxiv.org/pdf/1705.08741.pdf

That paper suggests using a “ghost” batchnorm that reduces the effective batch size by applying batchnorm to “virtual” minibatches.

There’s also some empirical evidence from Myrtle.ai (who recently trained Cifar10 to 94% accuracy in 34 seconds) that large batch batchnorm performed worse compared to ghost batchnorm with an effective batch size of 32.

2 Likes

I believe the correct dataset in this table is Imagenette, since woof=0 by default.

1 Like

Good catch, I corrected it.

1 Like

I propose to reconsider using OneCycle. RAdam and Novograd - both don’t need warmup, as opposite to Adam. We can utilise this property and introduce different learning rate policy.

I’ve used flat LR and then cosine annealing.

I did simple script to run training 20 times 5 epoch and calculate mean and std. So far looks good.

updated.

https://github.com/mgrankin/over9000

Imagenette 128 scored 0.8746 on 20 runs with Over9000. That is 1.69% bigger than LB. Imagewoof did +2.89%.

5 Likes

I think that the reason why RAdam/Ranger scored worse that Adam is the LR schedule. This new LR schedule (flat and annealing) is the first thing that came to my mind, it could be not great. There should be the LR schedule for RAdam that will be as great as OneCycle for Adam.

1 Like

Great work! Looks like we have a winner.

Can I suggest running Adam with flat and anneal?

A small detail: Your xresnet [1] uses channel sizes = [c_in,32,64,64], while the one on the fastai rep [2] has sizes = [c_in,32,32,64]. (IIRC the different version comes from a fastai course notebook, which I used, and then it went to you)
It actually might make results a tiny bit better.

[1] https://github.com/mgrankin/over9000/blob/master/xresnet.py
[2] https://github.com/fastai/fastai/blob/master/fastai/vision/models/xresnet.py

Oh, I’ll change that to [c_in,32,32,64]. Thank you for pointing this out.

I would leave it as [c_in,32,64,64], you’ve done all your runs with the other version so far, and it seems to do better anyway. I just wanted to make sure that detail doesn’t confuse someone later on.

To be back on the “guidelines” topic, I think @grankin’s github is a good example of what I’d like to see when someone updates the leaderboard: easy to run code, as well as a notebook with organized and detailed results. Obviously, future entrants won’t have to rerun the baseline.

One issue could be if a repo gets updated or deleted after entering the leaderboard. Maybe we need a snapshot fork of the repo and have that be the link. Or have entrants just add their stuff in a folder on the fastai repo.

Adam with flat_and_anneal: 84.19% (worse than one-cycle)
Over9000 with 1cycle: 86.9% (still better than baseline, but worse than with flat+anneal)

(all on Imagenette)

1 Like

This is great news @grankin! Thanks for doing this work. I was wondering as well if the fit one cycle might be disrupting how Radam, etc. work rather than helping.

Also, thanks for the updated training script. I’m going to try and test this triple combo (RAdam+Lars+LookAhead) now on 20 epochs.

2 Likes

I rerun with this parameters and updated the repo. The baseline and other results became a bit worse, but the difference approx the same.

I see. I uploaded my results here: https://github.com/sdoria/over9000 (readme and imagenette notebook)
But I used [c_in,32,64,64]

How do you suggest we merge everything?

If [c_in,32,64,64] gives better results, shouldn’t we use that?

1 Like

It gives better results, but it contains more parameters. I’m not sure if we should change base model in fast.ai.

1 Like

I guess the difference is small anyway, it’s just unfortunate that our results aren’t consistent. But we can keep them in separate spaces for now.

1 Like

1 - I would vote we go with whatever works best re; [c_in,32,64,64].
That’s the whole goal of Fastai right, what works best is what we use? The problem of course is now that would disrupt all the baselines for the LB, though at this point it seems the LB is off anyway due to the GPU issue.

2 - Side note but the name Over9000 is highly confusing to me. Since Over9000, it’s Ranger+Lars, can we refer to it as RangerLars or Ranger2 or something? That ways it is clearer what it is?

(@grankin - this is your decision since you added Lars in :slight_smile: I’d like to update my article to promote your improvement since this is achieving the best scores for 5 epochs, and I think RangerLars or Ranger2 or even RangerPlus is much clearer for people).

I’m running 20 epoch testing now. I’m also going to run Novograd with one change per the paper (.95, .98 for momentum and alpha).

1 Like

I’m running 2 runs each of 80 epochs for both baseline and over9000 on Imagenette.

1 Like

Excellent, thanks @Seb. I was thinking that 80 is likely the best test since for production systems, we really care about where it ends up and not say how fast it gets to 5 or 10 or whatever.
The negative is that 80 epoch testing for people like me using paid servers get’s expensive fast. But again it’s more likely that’s the best judge of results.