what about the bias correction of the exp moving averaging

Jeremy is talking about it now.

Is this why I saw opt.lr overshooting my target lr the other day?

Edit: actually I bet itâs becaue of the oppositeâŚ since momentum goes down the dampening does too so the amount added increases compared to the amount coming in from previous iterations

I donât know what optimizer you were using, but if it was the default Adam, the previous conversation isnât applicable.

Noooooooooooooooooooooooooooooooooooooooo, Jeremy you are breaking it!!!

Is `eps`

still in the wrong place? The divide will happen first.

Edit: It matches up with the forumla, though. So it must be correct.

Donât listen to Jeremy, in Adam, the epsilon goes outside the square root. He just undid a fix I pushed a week ago where our Adam **wasnât** working.

Jeremy just said you can put the epsilon inside the sqrt or not, but youâre saying inside itâs breaking Adam? Do you know why it is breaking it?

Is the smell test is to check if everything in the denom could go to zero and the epsilon is the safe term?

The real answer is that if you put it inside, you would need a default to 1e-10 or 1e-14, so that it matches the defaut of Adam of 1e-5 or 1e-7 you have in Adam with epsilon outside of the square root.

This is what Jeremy explained when he showed the square roots are different.

epsilon as a hyperparam ?

you said norm is sum of squares but the code says mean of squares (for r1)

Yes, you can even schedule it

Norm is the mean of squares, Jeremy misspoke.

So lamb works well for language modelsâŚ should it work well for other domains as well? Are there places it wouldnât be as good as Adam?

I thought L2 Norm was square root of sum of squares: http://mathworld.wolfram.com/VectorNorm.html

Itâs for big batch size training mostly. Iâm not sure the difference is going to be visible for small batch sizes, but we didnât have time to experiment a lot yet.

It is in math, but not for us in deep learning. If you have a billion parameters, the L2 norm is going to be way higher than if you have a thousand parameters, so we need to scale it up to the size.

Also, in this case, it doesnât matter since we make a quotient of two of those norms with tensors the same size.

Edit: and PyTorch uses the mean of squares for `x.norm()`

.

Instead of using %timeit I would recommand using the https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/execute_time/readme.html itâs much nicer and becomes part of your normal dev process of optimizing each box of code.