Lesson 11 discussion and wiki

what about the bias correction of the exp moving averaging

Jeremy is talking about it now.

Is this why I saw opt.lr overshooting my target lr the other day?

Edit: actually I bet it’s becaue of the opposite… since momentum goes down the dampening does too so the amount added increases compared to the amount coming in from previous iterations :thinking:

I don’t know what optimizer you were using, but if it was the default Adam, the previous conversation isn’t applicable.

1 Like

Noooooooooooooooooooooooooooooooooooooooo, Jeremy you are breaking it!!!


Is eps still in the wrong place? The divide will happen first.

Edit: It matches up with the forumla, though. So it must be correct.


Don’t listen to Jeremy, in Adam, the epsilon goes outside the square root. He just undid a fix I pushed a week ago where our Adam wasn’t working.


Jeremy just said you can put the epsilon inside the sqrt or not, but you’re saying inside it’s breaking Adam? Do you know why it is breaking it?


Is the smell test is to check if everything in the denom could go to zero and the epsilon is the safe term?

The real answer is that if you put it inside, you would need a default to 1e-10 or 1e-14, so that it matches the defaut of Adam of 1e-5 or 1e-7 you have in Adam with epsilon outside of the square root.
This is what Jeremy explained when he showed the square roots are different.


epsilon as a hyperparam ?

you said norm is sum of squares but the code says mean of squares (for r1)

Yes, you can even schedule it :wink:


Norm is the mean of squares, Jeremy misspoke.

So lamb works well for language models… should it work well for other domains as well? Are there places it wouldn’t be as good as Adam?

1 Like

I thought L2 Norm was square root of sum of squares: http://mathworld.wolfram.com/VectorNorm.html

1 Like

It’s for big batch size training mostly. I’m not sure the difference is going to be visible for small batch sizes, but we didn’t have time to experiment a lot yet.


It is in math, but not for us in deep learning. If you have a billion parameters, the L2 norm is going to be way higher than if you have a thousand parameters, so we need to scale it up to the size.


Also, in this case, it doesn’t matter since we make a quotient of two of those norms with tensors the same size.
Edit: and PyTorch uses the mean of squares for x.norm().

1 Like

Instead of using %timeit I would recommand using the https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/execute_time/readme.html it’s much nicer and becomes part of your normal dev process of optimizing each box of code.