I think it was going to be included in v2 when they launch the customizable optimizer: Fastai v2 roadmap
If you use Pytorch, you can try my implementation here
I tested it on Transformer XL with batch size 4400 and it performed well, though not quite linear.
Oh thanks! I hand’t noticed they had removed the debiasing in v3. Did you find it changed something while training? It’s only useful at the beginning but could avoid getting somewhere bad I guess.
Also is not the quotient weight_norm/adam_norm that is clipped to 10?
The optimizers for v2 are all ready an in this notebook if someone wants a preview.
They changed the clipping in v3 to apply to weight norm, and no longer specify what value they used (just “gamma”). They refer to the clipping as “phi”. When I removed the debias it didn’t seem to have much impact. Maybe it would learn faster in the beginning.
Oh interesting. But the adam norm can be pretty small sometimes, so I wonder why they did this.
Are the V2 optimizers backwards compatible? I’m trying to find a LAMB optimizer right now.
Not really since we don’t rely on PyTorch optimizer anymore.
Is this available somewhere? I am interested in using LAMB optimizer with fastai v1 for a project and it would be really helpful to have your implementation as a reference.
We used benjmann’s version he posted above and it worked great.
Ok thanks.
It also seems that it isn’t terribly difficult to possibly write my own version of the optimizer so I might try that as well as it will help me better understand the optimizer.
Never a bad idea
Hey there , this LAMB optimizer is really interesting I have just one question. Since Jeremy talks about this in lesson 10 I guess, why aren’t we implementing all the RootMeanSquared with MeanAbs?
This could apply to the EWMA of the squares of the grads in ADAM and also in R1 and R2 where we could go straight p.data.abs().mean() and p.grad.data.abs().mean()
Maybe this won’t debias well, I haven’t looked at the math…