ImageNette/Woof Leaderboards - guidelines for proving new high scores?

It gives better results, but it contains more parameters. I’m not sure if we should change base model in fast.ai.

1 Like

I guess the difference is small anyway, it’s just unfortunate that our results aren’t consistent. But we can keep them in separate spaces for now.

1 Like

1 - I would vote we go with whatever works best re; [c_in,32,64,64].
That’s the whole goal of Fastai right, what works best is what we use? The problem of course is now that would disrupt all the baselines for the LB, though at this point it seems the LB is off anyway due to the GPU issue.

2 - Side note but the name Over9000 is highly confusing to me. Since Over9000, it’s Ranger+Lars, can we refer to it as RangerLars or Ranger2 or something? That ways it is clearer what it is?

(@grankin - this is your decision since you added Lars in :slight_smile: I’d like to update my article to promote your improvement since this is achieving the best scores for 5 epochs, and I think RangerLars or Ranger2 or even RangerPlus is much clearer for people).

I’m running 20 epoch testing now. I’m also going to run Novograd with one change per the paper (.95, .98 for momentum and alpha).

1 Like

I’m running 2 runs each of 80 epochs for both baseline and over9000 on Imagenette.

1 Like

Excellent, thanks @Seb. I was thinking that 80 is likely the best test since for production systems, we really care about where it ends up and not say how fast it gets to 5 or 10 or whatever.
The negative is that 80 epoch testing for people like me using paid servers get’s expensive fast. But again it’s more likely that’s the best judge of results.

More parameters = more FLOPS. It may work worse given same FLOPS budget.

It wasn’t really my idea to combine all three improvements. It’s Federico Andres Lois (https://twitter.com/federicolois) suggested that. He combined LARS with RAdam before and named his hybrid Ralamb. Please, give all credit in your article to him. I don’t mind changing name if you think that will make things less confusing.

1 Like

Hi @grankin - thanks for the info!
1 - You definitely get credit for proving the flat+anneal learning rate innovation, and for coding it up! Thus, I’ll update the article and give both you and Federico credit.

2 - I think it’s clearer to name it “RangerLars”… I had also made RangerNovo (Ranger+Novograd) already so it keeps with a similar naming scheme. Thanks for the flex on the name change - I think it will help people understand what it is much more quickly this way.

3 - I’m testing the 20 epochs and RangerLars has cleanly beat the old leaderboard at least:
6 epochs (5+1 was my thinking) = 79.8, 80.40, 80.6, 80.6, 80.6, 81.60
Results: 80.60 / stdev .58
I re-ran Adam under the cosine anneal (i.e. same exact setup as here) and got 78.60 as a quick check.

So RangerLars beats old leaderboard and quick Adam sanity check for Woof 20 epoch by 2.0-2.2%.

1 Like

Just realized that RAdam + LARS + LookAHead is 40% slower than Adam. Can you confirm? If so, the 5 epoch results are not very useful… Ran into the same issue with SimpleSelfAttention.

1 Like

I’ve got the following averages per epoch (1 GPU):
adam = 20 seconds
novograd = 23 seconds
rangernovo = 23-24 seconds
rangerlars = 32-33 seconds

However, someone just posted on my github that Ranger was not keeping the slow weights on the GPU but rather the CPU and that might be the reason for the lower performance. I had thought that FastAI would move everything to GPU but this might be the change in perf.

Pablo Pernias posted:
" Are you by any chance creating your Ranger optimizer before moving your model to CUDA?

If so, what you’re experiencing is a know error on optimizers that imitialize internal state based on the model parameters, like Adagrad, and would be solved just by instantiating your optimizer after you move your model to CUDA.

Here’s a post where they mention this issue: https://discuss.pytorch.org/t/effect-of-calling-model-cuda-after-constructing-an-optimizer/15165/7

It’s not so common moving the optimizer parameters to cpu/cuda after instantiating them, so I think adding a ‘to’ method to the optimizer with that purpose, as you suggest, would not be very familiar to regular PyTorch users."

3 Likes

and Jonas Stepanik wrote:
Variables self.slow_weights are always on cpu.
You can easily fix this by adding a .to() method in Ranger class like so:

def to(self, device):    
    if device is "cuda":
        for i in range(len(self.slow_weights)):
            for j, w in enumerate(self.slow_weights[i]):
                self.slow_weights[i][j] = w.cuda()
    elif device is "cpu":
        for i in range(len(self.slow_weights)):
            for j, w in enumerate(self.slow_weights[i]):
                self.slow_weights[i][j] = w.cpu()
3 Likes

80 epoch results: https://github.com/sdoria/over9000/blob/master/Imagenette_128_80_epoch-bs64.ipynb

LARS slows down things quite a bit (in terms of run time), and RAdam + LARS + LookAHead don’t seem to have results impressive enough to warrant the extra run time.

1 Like

Thanks very much @Seb for the long range testing!

I see in your results that RangerLars smashed Adam on the 20 epoch but Adam returns for the 80 epoch win…so now the question becomes do the 5 and even 20 epoch leaderboards have as much value as the 80 given that for a production system, the ‘final’ accuracy would be the most valuable.

I’m testing Stochastic Adam now for an 80 epoch run. This one is super smooth curve but was always slower in terms of to 20…so now I want to see if it continues to progress to 80 epochs and how that does.

1 Like

well stochastic adam didn’t ultimately perform on 80 epoch either.
Tomorrow the code for AutoOpt will be released - self tuning for LR and momentum automatically based on it’s own internal oracle gradient.
So you don’t set anything, just put it to work.
Let’s see how that does, I have high hopes for it!

2 Likes

@LessW2020 Just to give a few context details, if any of it is useful for your article just use any part of it.

The idea of including LARS into Radam came because I was already working with LAMB because the problem I work at was sloow as hell to train. When I discovered LAMB a month ago I did the same thing as anyone would have done, just try it. Any second I could save per epoch on my prototype environment would be a win either way. I had been gradually working my way up to the current state which is 32K virtual batches. My biggest physical batch per GPU is around 256 samples, but I still run the optimizer only after having accumulated 32K worth of samples. That with some other optimizations pushed me from 1 hour per epoch to under 3 minutes, which for prototyping is a-thing.

The key of such speed up is that even if LAMB was actually slower per iteration than Adam, for every 128 iterations of Adam I am currently running just 1 of LAMB which is a huge win, and more, with very good if not negligible impact on accuracy… (which is difficult to measure exactly in my domain, anyways, but that is another story).

When you published about Radam I got the source and just tried it on the spot, I liked what I saw but it was not as fast :slight_smile: … So I studied the paper and the code and made the modification I named Ralamb. By the time I read someone suggested the lookahead optimizer, I got the code and integrated with Ralamb and it was goooood. The rest can be summarized into you publishing the article about Ranger, my immediate tweet and then @grankin doing a hell of a fast stride forward testing it out on a ‘leaderboard’ task for comparisons.

So to be fair with RangerLars my suggestion is to not use the default batch size, bump it up until you hit the sweet spot. I am doing fine with 128x but I would expect the milage to vary on different problems and then to figure out if it is ‘fast enough’ or not.

8 Likes

Your LARS + Radam + LookAhead work you shared excites me personally for image to image work (as you can imagine- batch sizes get severely limited quickly). Thank you! Good points on the batch size consideration- I was thinking and was about to suggest the same.

I’d add one more thing: What I find great about RAdam and these variants is that I can move away from using RMSProp. Vanilla Adam just wasn’t working for me for my current work as it was too unstable. I expect even more enthusiastic adoption of this for GANs and other more unstable models. So I think that’s quite important to consider here.

6 Likes

Ups. That’s true.

2 Likes

So it seems we missed the whole point of LARS if it’s meant to be run on 128 GPU’s…

Hi,
I’ve been following this thread and find it very interesting and useful.
I’ve noticed there’s a small discrepancy between @Redknight’s implementation and the LAMB paper.
In v1 of the paper there was a section:

3.3.1 Upper bound of trust ratio (a variant of LAMB)

Even when we use the element-wise updating based on the estimates of first and second mo- ments of the gradients, we can still use |g| = ∥∇L(xi,w)∥2 in the trust ratio computation. However, due to the inaccurate information, we observe that some of the LARS trust ratios are extremely large. Since the original LARS optimizer used momentum SGD as the base opti- mizer, the large trust ratio does not have a significant negative impact because all the elements in a certain layer use the same learning rate. However, for the adaptive optimizers like Adam and Adagrad, different elements will have different element-wise learning rates. In this situa- tion, a large trust ratio may lead to the divergence of a weight with a large learning rate. One practical fix is to set an upper bound of the trust ratio (e.g. setting the bound as 10). By this way, we can still successfully scale the batch size of BERT training to 16K and will not add any computational and memory overhead to the LAMB optimizer.

This section has been removed in v3.
@Redknight’s implementation clamps the weight_norm though
weight_norm = p.data.pow(2).sum().sqrt().clamp(0, 10)
, instead of the trust_ratio.
There doesn’t seem to be any official implementation of the algo, but in the pseudocode, they don’t clamp the trust_ratio.
So I think it’d might be interesting to test 2 alternative options:

  1. Setting a limit of 10 to the trust_ratio instead of the weight_norm (as in v1). Removing the clamping of weight_norm, and adding the trust_ration clamping.
  2. Without any limits (as in v3). Removing the clamping of weight_norm.
    Edit: I won’t be able to test anything at the moment, as I don’t have access to a GPU
2 Likes

@Seb While the abstract said: “Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.” it is very easy to overlook the fact that increasing the batch size works both ways, not only about multi-gpu scaling but also because you can skip executing backward phases in single-gpu scenarios making it very efficient compute wise.

@oguiza +1 on testing those 2 alternatives on RangerLars

Thank you for the clarification, seems very promising. I will take a look.

Edit to add: so if I understand correctly, we need to accumulate gradients to increase batch size on 1 GPU? Fastai has a callback that does that (https://docs.fast.ai/train.html#AccumulateScheduler) but it doesn’t play well with batch norm…

1 Like