LAMB Optimizer

yeldarb · June 20, 2019, 7:32pm

I think it was going to be included in v2 when they launch the customizable optimizer: Fastai v2 roadmap

benjmann · June 20, 2019, 7:45pm

If you use Pytorch, you can try my implementation here

I tested it on Transformer XL with batch size 4400 and it performed well, though not quite linear.

sgugger · June 20, 2019, 8:45pm

Oh thanks! I hand’t noticed they had removed the debiasing in v3. Did you find it changed something while training? It’s only useful at the beginning but could avoid getting somewhere bad I guess.
Also is not the quotient weight_norm/adam_norm that is clipped to 10?

The optimizers for v2 are all ready an in this notebook if someone wants a preview.

benjmann · June 20, 2019, 9:16pm

They changed the clipping in v3 to apply to weight norm, and no longer specify what value they used (just “gamma”). They refer to the clipping as “phi”. When I removed the debias it didn’t seem to have much impact. Maybe it would learn faster in the beginning.

sgugger · June 20, 2019, 9:50pm

Oh interesting. But the adam norm can be pretty small sometimes, so I wonder why they did this.

Even · July 17, 2019, 6:11pm

Are the V2 optimizers backwards compatible? I’m trying to find a LAMB optimizer right now.

sgugger · July 17, 2019, 6:20pm

Not really since we don’t rely on PyTorch optimizer anymore.

ilovescience · July 23, 2019, 2:11am

Is this available somewhere? I am interested in using LAMB optimizer with fastai v1 for a project and it would be really helpful to have your implementation as a reference.

Even · July 23, 2019, 8:57pm

We used benjmann’s version he posted above and it worked great.

ilovescience · July 23, 2019, 10:30pm

Ok thanks.

It also seems that it isn’t terribly difficult to possibly write my own version of the optimizer so I might try that as well as it will help me better understand the optimizer.

Even · July 23, 2019, 11:16pm

Never a bad idea

Brainkite · January 16, 2020, 12:56pm

Hey there , this LAMB optimizer is really interesting I have just one question. Since Jeremy talks about this in lesson 10 I guess, why aren’t we implementing all the RootMeanSquared with MeanAbs?
This could apply to the EWMA of the squares of the grads in ADAM and also in R1 and R2 where we could go straight p.data.abs().mean() and p.grad.data.abs().mean()
Maybe this won’t debias well, I haven’t looked at the math…