Jeremy is answering this now.
That’s what Jeremy is showing. But unless you have dampening, momentum doesn’t have the 0.1. Check the PyTorch source code
You’re not doubling it since your old contributions have 0.9**i after i iterations, and that get to 0 pretty quickly.
what about the bias correction of the exp moving averaging
Jeremy is talking about it now.
Is this why I saw opt.lr overshooting my target lr the other day?
Edit: actually I bet it’s becaue of the opposite… since momentum goes down the dampening does too so the amount added increases compared to the amount coming in from previous iterations
I don’t know what optimizer you were using, but if it was the default Adam, the previous conversation isn’t applicable.
Noooooooooooooooooooooooooooooooooooooooo, Jeremy you are breaking it!!!
Is eps
still in the wrong place? The divide will happen first.
Edit: It matches up with the forumla, though. So it must be correct.
Don’t listen to Jeremy, in Adam, the epsilon goes outside the square root. He just undid a fix I pushed a week ago where our Adam wasn’t working.
Jeremy just said you can put the epsilon inside the sqrt or not, but you’re saying inside it’s breaking Adam? Do you know why it is breaking it?
Is the smell test is to check if everything in the denom could go to zero and the epsilon is the safe term?
The real answer is that if you put it inside, you would need a default to 1e-10 or 1e-14, so that it matches the defaut of Adam of 1e-5 or 1e-7 you have in Adam with epsilon outside of the square root.
This is what Jeremy explained when he showed the square roots are different.
epsilon as a hyperparam ?
you said norm is sum of squares but the code says mean of squares (for r1)
Yes, you can even schedule it
Norm is the mean of squares, Jeremy misspoke.
So lamb works well for language models… should it work well for other domains as well? Are there places it wouldn’t be as good as Adam?
I thought L2 Norm was square root of sum of squares: http://mathworld.wolfram.com/VectorNorm.html
It’s for big batch size training mostly. I’m not sure the difference is going to be visible for small batch sizes, but we didn’t have time to experiment a lot yet.