ImageNette/Woof Leaderboards - guidelines for proving new high scores?

Great work! Looks like we have a winner.

Can I suggest running Adam with flat and anneal?

A small detail: Your xresnet [1] uses channel sizes = [c_in,32,64,64], while the one on the fastai rep [2] has sizes = [c_in,32,32,64]. (IIRC the different version comes from a fastai course notebook, which I used, and then it went to you)
It actually might make results a tiny bit better.

[1] https://github.com/mgrankin/over9000/blob/master/xresnet.py
[2] https://github.com/fastai/fastai/blob/master/fastai/vision/models/xresnet.py

Oh, I’ll change that to [c_in,32,32,64]. Thank you for pointing this out.

I would leave it as [c_in,32,64,64], you’ve done all your runs with the other version so far, and it seems to do better anyway. I just wanted to make sure that detail doesn’t confuse someone later on.

To be back on the “guidelines” topic, I think @grankin’s github is a good example of what I’d like to see when someone updates the leaderboard: easy to run code, as well as a notebook with organized and detailed results. Obviously, future entrants won’t have to rerun the baseline.

One issue could be if a repo gets updated or deleted after entering the leaderboard. Maybe we need a snapshot fork of the repo and have that be the link. Or have entrants just add their stuff in a folder on the fastai repo.

Adam with flat_and_anneal: 84.19% (worse than one-cycle)
Over9000 with 1cycle: 86.9% (still better than baseline, but worse than with flat+anneal)

(all on Imagenette)

1 Like

This is great news @grankin! Thanks for doing this work. I was wondering as well if the fit one cycle might be disrupting how Radam, etc. work rather than helping.

Also, thanks for the updated training script. I’m going to try and test this triple combo (RAdam+Lars+LookAhead) now on 20 epochs.

2 Likes

I rerun with this parameters and updated the repo. The baseline and other results became a bit worse, but the difference approx the same.

I see. I uploaded my results here: https://github.com/sdoria/over9000 (readme and imagenette notebook)
But I used [c_in,32,64,64]

How do you suggest we merge everything?

If [c_in,32,64,64] gives better results, shouldn’t we use that?

1 Like

It gives better results, but it contains more parameters. I’m not sure if we should change base model in fast.ai.

1 Like

I guess the difference is small anyway, it’s just unfortunate that our results aren’t consistent. But we can keep them in separate spaces for now.

1 Like

1 - I would vote we go with whatever works best re; [c_in,32,64,64].
That’s the whole goal of Fastai right, what works best is what we use? The problem of course is now that would disrupt all the baselines for the LB, though at this point it seems the LB is off anyway due to the GPU issue.

2 - Side note but the name Over9000 is highly confusing to me. Since Over9000, it’s Ranger+Lars, can we refer to it as RangerLars or Ranger2 or something? That ways it is clearer what it is?

(@grankin - this is your decision since you added Lars in :slight_smile: I’d like to update my article to promote your improvement since this is achieving the best scores for 5 epochs, and I think RangerLars or Ranger2 or even RangerPlus is much clearer for people).

I’m running 20 epoch testing now. I’m also going to run Novograd with one change per the paper (.95, .98 for momentum and alpha).

1 Like

I’m running 2 runs each of 80 epochs for both baseline and over9000 on Imagenette.

1 Like

Excellent, thanks @Seb. I was thinking that 80 is likely the best test since for production systems, we really care about where it ends up and not say how fast it gets to 5 or 10 or whatever.
The negative is that 80 epoch testing for people like me using paid servers get’s expensive fast. But again it’s more likely that’s the best judge of results.

More parameters = more FLOPS. It may work worse given same FLOPS budget.

It wasn’t really my idea to combine all three improvements. It’s Federico Andres Lois (https://twitter.com/federicolois) suggested that. He combined LARS with RAdam before and named his hybrid Ralamb. Please, give all credit in your article to him. I don’t mind changing name if you think that will make things less confusing.

1 Like

Hi @grankin - thanks for the info!
1 - You definitely get credit for proving the flat+anneal learning rate innovation, and for coding it up! Thus, I’ll update the article and give both you and Federico credit.

2 - I think it’s clearer to name it “RangerLars”… I had also made RangerNovo (Ranger+Novograd) already so it keeps with a similar naming scheme. Thanks for the flex on the name change - I think it will help people understand what it is much more quickly this way.

3 - I’m testing the 20 epochs and RangerLars has cleanly beat the old leaderboard at least:
6 epochs (5+1 was my thinking) = 79.8, 80.40, 80.6, 80.6, 80.6, 81.60
Results: 80.60 / stdev .58
I re-ran Adam under the cosine anneal (i.e. same exact setup as here) and got 78.60 as a quick check.

So RangerLars beats old leaderboard and quick Adam sanity check for Woof 20 epoch by 2.0-2.2%.

1 Like

Just realized that RAdam + LARS + LookAHead is 40% slower than Adam. Can you confirm? If so, the 5 epoch results are not very useful… Ran into the same issue with SimpleSelfAttention.

1 Like

I’ve got the following averages per epoch (1 GPU):
adam = 20 seconds
novograd = 23 seconds
rangernovo = 23-24 seconds
rangerlars = 32-33 seconds

However, someone just posted on my github that Ranger was not keeping the slow weights on the GPU but rather the CPU and that might be the reason for the lower performance. I had thought that FastAI would move everything to GPU but this might be the change in perf.

Pablo Pernias posted:
" Are you by any chance creating your Ranger optimizer before moving your model to CUDA?

If so, what you’re experiencing is a know error on optimizers that imitialize internal state based on the model parameters, like Adagrad, and would be solved just by instantiating your optimizer after you move your model to CUDA.

Here’s a post where they mention this issue: https://discuss.pytorch.org/t/effect-of-calling-model-cuda-after-constructing-an-optimizer/15165/7

It’s not so common moving the optimizer parameters to cpu/cuda after instantiating them, so I think adding a ‘to’ method to the optimizer with that purpose, as you suggest, would not be very familiar to regular PyTorch users."

3 Likes

and Jonas Stepanik wrote:
Variables self.slow_weights are always on cpu.
You can easily fix this by adding a .to() method in Ranger class like so:

def to(self, device):    
    if device is "cuda":
        for i in range(len(self.slow_weights)):
            for j, w in enumerate(self.slow_weights[i]):
                self.slow_weights[i][j] = w.cuda()
    elif device is "cpu":
        for i in range(len(self.slow_weights)):
            for j, w in enumerate(self.slow_weights[i]):
                self.slow_weights[i][j] = w.cpu()
3 Likes

80 epoch results: https://github.com/sdoria/over9000/blob/master/Imagenette_128_80_epoch-bs64.ipynb

LARS slows down things quite a bit (in terms of run time), and RAdam + LARS + LookAHead don’t seem to have results impressive enough to warrant the extra run time.

1 Like

Thanks very much @Seb for the long range testing!

I see in your results that RangerLars smashed Adam on the 20 epoch but Adam returns for the 80 epoch win…so now the question becomes do the 5 and even 20 epoch leaderboards have as much value as the 80 given that for a production system, the ‘final’ accuracy would be the most valuable.

I’m testing Stochastic Adam now for an 80 epoch run. This one is super smooth curve but was always slower in terms of to 20…so now I want to see if it continues to progress to 80 epochs and how that does.

1 Like