ImageNette/Woof Leaderboards - guidelines for proving new high scores?

LessW2020 · August 26, 2019, 1:05am

well stochastic adam didn’t ultimately perform on 80 epoch either.
Tomorrow the code for AutoOpt will be released - self tuning for LR and momentum automatically based on it’s own internal oracle gradient.
So you don’t set anything, just put it to work.
Let’s see how that does, I have high hopes for it!

Redknight · August 26, 2019, 3:35am

@LessW2020 Just to give a few context details, if any of it is useful for your article just use any part of it.

The idea of including LARS into Radam came because I was already working with LAMB because the problem I work at was sloow as hell to train. When I discovered LAMB a month ago I did the same thing as anyone would have done, just try it. Any second I could save per epoch on my prototype environment would be a win either way. I had been gradually working my way up to the current state which is 32K virtual batches. My biggest physical batch per GPU is around 256 samples, but I still run the optimizer only after having accumulated 32K worth of samples. That with some other optimizations pushed me from 1 hour per epoch to under 3 minutes, which for prototyping is a-thing.

The key of such speed up is that even if LAMB was actually slower per iteration than Adam, for every 128 iterations of Adam I am currently running just 1 of LAMB which is a huge win, and more, with very good if not negligible impact on accuracy… (which is difficult to measure exactly in my domain, anyways, but that is another story).

When you published about Radam I got the source and just tried it on the spot, I liked what I saw but it was not as fast … So I studied the paper and the code and made the modification I named Ralamb. By the time I read someone suggested the lookahead optimizer, I got the code and integrated with Ralamb and it was goooood. The rest can be summarized into you publishing the article about Ranger, my immediate tweet and then @grankin doing a hell of a fast stride forward testing it out on a ‘leaderboard’ task for comparisons.

So to be fair with RangerLars my suggestion is to not use the default batch size, bump it up until you hit the sweet spot. I am doing fine with 128x but I would expect the milage to vary on different problems and then to figure out if it is ‘fast enough’ or not.

jsa169 · August 26, 2019, 4:21am

Your LARS + Radam + LookAhead work you shared excites me personally for image to image work (as you can imagine- batch sizes get severely limited quickly). Thank you! Good points on the batch size consideration- I was thinking and was about to suggest the same.

I’d add one more thing: What I find great about RAdam and these variants is that I can move away from using RMSProp. Vanilla Adam just wasn’t working for me for my current work as it was too unstable. I expect even more enthusiastic adoption of this for GANs and other more unstable models. So I think that’s quite important to consider here.

grankin · August 26, 2019, 8:26am

Ups. That’s true.

Seb · August 26, 2019, 12:37pm

So it seems we missed the whole point of LARS if it’s meant to be run on 128 GPU’s…

oguiza · August 26, 2019, 1:18pm

Hi,
I’ve been following this thread and find it very interesting and useful.
I’ve noticed there’s a small discrepancy between @Redknight’s implementation and the LAMB paper.
In v1 of the paper there was a section:

3.3.1 Upper bound of trust ratio (a variant of LAMB)

Even when we use the element-wise updating based on the estimates of first and second mo- ments of the gradients, we can still use |g| = ∥∇L(xi,w)∥2 in the trust ratio computation. However, due to the inaccurate information, we observe that some of the LARS trust ratios are extremely large. Since the original LARS optimizer used momentum SGD as the base opti- mizer, the large trust ratio does not have a significant negative impact because all the elements in a certain layer use the same learning rate. However, for the adaptive optimizers like Adam and Adagrad, different elements will have different element-wise learning rates. In this situa- tion, a large trust ratio may lead to the divergence of a weight with a large learning rate. One practical fix is to set an upper bound of the trust ratio (e.g. setting the bound as 10). By this way, we can still successfully scale the batch size of BERT training to 16K and will not add any computational and memory overhead to the LAMB optimizer.

This section has been removed in v3.
@Redknight’s implementation clamps the weight_norm though
weight_norm = p.data.pow(2).sum().sqrt().clamp(0, 10)
, instead of the trust_ratio.
There doesn’t seem to be any official implementation of the algo, but in the pseudocode, they don’t clamp the trust_ratio.
So I think it’d might be interesting to test 2 alternative options:

Setting a limit of 10 to the trust_ratio instead of the weight_norm (as in v1). Removing the clamping of weight_norm, and adding the trust_ration clamping.
Without any limits (as in v3). Removing the clamping of weight_norm.
Edit: I won’t be able to test anything at the moment, as I don’t have access to a GPU

Redknight · August 26, 2019, 1:22pm

@Seb While the abstract said: “Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.” it is very easy to overlook the fact that increasing the batch size works both ways, not only about multi-gpu scaling but also because you can skip executing backward phases in single-gpu scenarios making it very efficient compute wise.

@oguiza +1 on testing those 2 alternatives on RangerLars

Seb · August 26, 2019, 1:34pm

Thank you for the clarification, seems very promising. I will take a look.

Edit to add: so if I understand correctly, we need to accumulate gradients to increase batch size on 1 GPU? Fastai has a callback that does that (https://docs.fast.ai/train.html#AccumulateScheduler) but it doesn’t play well with batch norm…

Seb · August 26, 2019, 2:58pm

Novograd doesn’t beat Adam on 20 epochs so far: https://github.com/sdoria/over9000/blob/master/Imagenette_128_20_epoch.ipynb

Redknight · August 26, 2019, 3:25pm

I am doing the training loop manually for that very same reason.

jsa169 · August 26, 2019, 5:41pm

I am doing the training loop manually for that very same reason.

Can you expand on your solution for batch norm if you have one…? That’s the one thing you don’t get an advantage on with gradient accumulation as far as I’ve seen because batch norm stats are still calculated per real batch.

Seb · August 26, 2019, 7:47pm

(Federico’s response)

Seb · August 27, 2019, 12:34am

If anyone wants to play around with a model that doesn’t have batch norm, there’s fixup-resnet from here (https://github.com/hongyi-zhang/Fixup/blob/master/imagenet/models/fixup_resnet_imagenet.py)

I ran it for 80 epochs on Imagenette 128px with adam (lr 3e-3) and it does train if you pick your lr carefully. I got ~90% in 80 epochs (actually 91.6% on epoch 63). Nothing too exciting compared to 95% with xresnet50, but epochs do run ~30% faster.

My goal was to see how AccumulateScheduler worked without batch norm layers, but I failed to get anything worth sharing. Maybe someone else will get better luck.

LessW2020 · August 27, 2019, 5:52am

Hi Federico,
Great to meet you and glad to see you here!
This info is really helpful to understand how you came up with putting LARS in and great work on the idea and also great work by @grankin for the fast coding and testing!
I appreciate the feedback regarding bumping up the batch size as well.
For reference, I spent all day testing out a new activation function Mish, including testing it with RangerLars. The combo of RangerLars + Mish performs extremely well.
I just posted an article about Mish but didn’t reference RangerLars yet so we can figure out the computational tradeoffs, batch size etc.
Anyway, thanks for the insights here and hope you’ll stick around on the boards here as it’s great to have you here!
Best regards,
Less

LessW2020 · August 27, 2019, 5:55am

Thanks very much @oguiza for this finding and idea re: testing! If no one beats me to it, I will try and test it out by this weekend and post results.

Redknight · August 27, 2019, 4:03pm

Just to let you know that I have updated the code of Ralamb because it had a defect in the calculation of step that was figured out by Yaroslav Geraskin.

Updated code with details on the changes in the comments. https://gist.github.com/redknightlois/c4023d393eb8f92bb44b2ab582d7ec20

EDIT: Another update which also applies the suggestion of clamping the trust_ratio instead than the norm by @oguiza which looks good on my dataset.

LessW2020 · August 27, 2019, 6:58pm

Fantastic, thanks for integrating both of these improvements!

fgfm · August 28, 2019, 4:51pm

Hey there,
I had a few issues with the latest update on RAdam implementation following the changes mentioned in the comment of the gist. So I went ahead, read back again both papers and implemented it myself, and it’s working quite well: https://github.com/frgfm/Holocron/blob/master/holocron/optim/radam.py

While implementing, I noticed a few things to consider explaining the differences in performances across implementations:

Update norm in LARS: in the paper, the norm of the update (=grad_term + wd * param) is computed as norm(grad_term) + wd * norm(param). I’ll investigate the difference with taking the overall norm(grad_term + wd * param) since this is strictly inferior to the previous if those two are not orthogonal.
LARS clipping: I set this as an optional argument of the optimizer

Last point: I ditched the buffer used in the first implementation as I didn’t see any purpose for it

Seb · September 7, 2019, 2:34pm

For imagenette/imagewoof, wouldn’t it save everyone’s bandwidth (including fastai’s) if fastai provided datasets that are resized to 256,192,128 just like in the leaderboard instead of 320, 160.

I ask this because we’ve found that you get much better results using the full-sized image and resizing directly to 128 instead of using the intermediary 160 px dataset. So people might start using the full sized dataset a lot more.

I want to thank Jeremy for providing those datasets. I hope it’s not too costly, I have downloaded it a bunch of times because I switch machines all the time and out of (major) convenience.

jeremy · January 18, 2020, 5:40am

No, you generally don’t want to use the same size images as the size you’re targeting. Remember, you’re doing random crops etc, so it’s best to have bigger images than your target size.