Lesson 11 discussion and wiki

Kaspar · April 11, 2019, 3:16am

what about the bias correction of the exp moving averaging

sgugger · April 11, 2019, 3:17am

Jeremy is talking about it now.

yeldarb · April 11, 2019, 3:18am

Is this why I saw opt.lr overshooting my target lr the other day?

Edit: actually I bet it’s becaue of the opposite… since momentum goes down the dampening does too so the amount added increases compared to the amount coming in from previous iterations

sgugger · April 11, 2019, 3:21am

I don’t know what optimizer you were using, but if it was the default Adam, the previous conversation isn’t applicable.

sgugger · April 11, 2019, 3:22am

Noooooooooooooooooooooooooooooooooooooooo, Jeremy you are breaking it!!!

JoshVarty · April 11, 2019, 3:23am

Is eps still in the wrong place? The divide will happen first.

Edit: It matches up with the forumla, though. So it must be correct.

sgugger · April 11, 2019, 3:24am

Don’t listen to Jeremy, in Adam, the epsilon goes outside the square root. He just undid a fix I pushed a week ago where our Adam wasn’t working.

PierreO · April 11, 2019, 3:27am

Jeremy just said you can put the epsilon inside the sqrt or not, but you’re saying inside it’s breaking Adam? Do you know why it is breaking it?

karthikramesh · April 11, 2019, 3:27am

Is the smell test is to check if everything in the denom could go to zero and the epsilon is the safe term?

sgugger · April 11, 2019, 3:28am

The real answer is that if you put it inside, you would need a default to 1e-10 or 1e-14, so that it matches the defaut of Adam of 1e-5 or 1e-7 you have in Adam with epsilon outside of the square root.
This is what Jeremy explained when he showed the square roots are different.

Kaspar · April 11, 2019, 3:28am

epsilon as a hyperparam ?

benjmann · April 11, 2019, 3:29am

you said norm is sum of squares but the code says mean of squares (for r1)

sgugger · April 11, 2019, 3:29am

Yes, you can even schedule it

sgugger · April 11, 2019, 3:29am

Norm is the mean of squares, Jeremy misspoke.

yeldarb · April 11, 2019, 3:31am

So lamb works well for language models… should it work well for other domains as well? Are there places it wouldn’t be as good as Adam?

JoshVarty · April 11, 2019, 3:32am

I thought L2 Norm was square root of sum of squares: http://mathworld.wolfram.com/VectorNorm.html

sgugger · April 11, 2019, 3:32am

It’s for big batch size training mostly. I’m not sure the difference is going to be visible for small batch sizes, but we didn’t have time to experiment a lot yet.

sgugger · April 11, 2019, 3:34am

It is in math, but not for us in deep learning. If you have a billion parameters, the L2 norm is going to be way higher than if you have a thousand parameters, so we need to scale it up to the size.

sgugger · April 11, 2019, 3:36am

Also, in this case, it doesn’t matter since we make a quotient of two of those norms with tensors the same size.
Edit: and PyTorch uses the mean of squares for x.norm().

firetix · April 11, 2019, 3:37am

Instead of using %timeit I would recommand using the https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/execute_time/readme.html it’s much nicer and becomes part of your normal dev process of optimizing each box of code.