LAMB Optimizer

For 1. anyone can have their own default. In every PyTorch optimizer, wd defaults to 0.0 so I followed that logic here.
2. is irrelevant in any case. Mathematically the L2 norm is the square root of the sum of squares, yes, but having the mean will be more numerically stable as we have less large numbers this way.
3. This is the same. In PyTorch, x.add_(l, b) does x += l*b.

2 Likes

Thanks for replying! OK that makes sense. I think the add function threw me off because it adds tensors but multiplies a scalar with a tensor before adding to the input which is how its setup here.

I have implemented LAMB in Tensorflow. Appreciate any feedback. :slight_smile:

this is the same as p.data -= lr * min(r1/r2,10) * step

p.data.add_(-lr * min(r1/r2,10) * step)

i have tried with : p.data.add_(-lr * r1/(r2+eps) * step)
it worked well on mnist if not better

Yes that should be better :slight_smile:

In the LAMB paper section 3.1 they talk about handling 0s in the trust ratio, but I don’t see that in the course notebook. Is something missing?

I found it quite hard to parse that section, but I found another implementation that says if r1 or r2 is 0, then r=1.

After much hacking I got StatefulOptimizer and lamb_func working with fastai v1. It’s really ugly right now but the results look promising.

I spent all day yesterday tweaking hyper-parameters on my language model and on the first run with LAMB I’m seeing a ~0.3% accuracy & validation loss improvement over my best results from yesterday (on 3 epochs, frozen).

Unfortunately it looks like I have a bug somewhere (probably somewhere in my param_groups hacks) because once I unfroze the losses and accuracy stopped improving much whereas with fastai v1 Adam they kept getting better.

The main things I had to do to get it working with fastai were:

  • Modify OptimWrapper to not wrap these new-style optimizers (I did this in a hacky way by detecting the presence of hypers on the optimizer)
  • Modified the constructor of Optimizer to account for OptimWrapper passing parameters in a different format to its initializer
  • Changed grad_params to not error out due to the different format of the parameters
  • Hacked into Optimizer.__setattr__ to propagate OneCycle's updates to the lr and mom properties through to hypers so they get passed to the stepper functions

The results are promising so I’m going to keep at it. I’m excited to see how things will look once I’m running bigger batch sizes on multiple GPUs since that’s where it’s really supposed to shine.

Edit: wd=0 worked even better than wd=0.01; picked up another 0.3% accuracy on the 3 frozen epochs (although with a bit higher validation loss).

3 Likes

I’m curious if LAMB made it into the fastai library. Did you submit a PR with your solution?

I think it was going to be included in v2 when they launch the customizable optimizer: Fastai v2 roadmap

1 Like

If you use Pytorch, you can try my implementation here

I tested it on Transformer XL with batch size 4400 and it performed well, though not quite linear.

4 Likes

Oh thanks! I hand’t noticed they had removed the debiasing in v3. Did you find it changed something while training? It’s only useful at the beginning but could avoid getting somewhere bad I guess.
Also is not the quotient weight_norm/adam_norm that is clipped to 10?

The optimizers for v2 are all ready an in this notebook if someone wants a preview.

They changed the clipping in v3 to apply to weight norm, and no longer specify what value they used (just “gamma”). They refer to the clipping as “phi”. When I removed the debias it didn’t seem to have much impact. Maybe it would learn faster in the beginning.

Oh interesting. But the adam norm can be pretty small sometimes, so I wonder why they did this.

Are the V2 optimizers backwards compatible? I’m trying to find a LAMB optimizer right now.

Not really since we don’t rely on PyTorch optimizer anymore.

1 Like

Is this available somewhere? I am interested in using LAMB optimizer with fastai v1 for a project and it would be really helpful to have your implementation as a reference.

2 Likes

We used benjmann’s version he posted above and it worked great.

2 Likes

Ok thanks.

It also seems that it isn’t terribly difficult to possibly write my own version of the optimizer so I might try that as well as it will help me better understand the optimizer.

2 Likes

Never a bad idea :slight_smile:

Hey there , this LAMB optimizer is really interesting I have just one question. Since Jeremy talks about this in lesson 10 I guess, why aren’t we implementing all the RootMeanSquared with MeanAbs?
This could apply to the EWMA of the squares of the grads in ADAM and also in R1 and R2 where we could go straight p.data.abs().mean() and p.grad.data.abs().mean()
Maybe this won’t debias well, I haven’t looked at the math…