Could we learn the parameters in the step such as in a meta learning scenario?
Yes of course they use heavy machinery, but being able to scale and using a 64k batch size is quite an achievement.
Not really as they donât appear in the loss, so we donât have gradients for them.
Why not define âself.grad_params=â during an __init__() once, when the Optimizer is first constructed?
Itâs exactly the same to define it as a computed property, I think.
I think jemery meant to say ._add is in place .add for pytorch
Really? Seems like he is rebuilding the list of grad_parameters every time he wants to refer to them.
The reason why you might generally want to do that is in case for whatever reason self.param_groups
or self.hypers
changes after calling __init__
. Properties (or in this case, a zero-argument method) keep attributes that should be consistent with each other consistent with each other.
Yes, but it doesnât take time to do so. For a super deep model, you may have 300 parameters (we count the different tensors).
shouldnât the momentum be 0.9prev + 0.1new_grad? The equation looks like 0.9*prev + new_grad.
And we made that mistake a lot of time, but basic momentum doesnât have the 0.1.
The journey to the deep knowledge seems to be pretty hard
One more reason to have a critical mind and ask questions.
pretty hard to explain people , who evaluate and approve models before moving to production
OK, but then you donât keep the gradient normalized - wouldnât it keep growing?
emphasizes the importance of getting the foundations right
but wouldnât that make the momentum grows out of control since we are essentially doubling it every time? 0.9 * old_mom + 1.0 * new_mom
as opposed to 0.9 * old_mom + 0.1 * new_mom
Jeremy is answering this now.
Thatâs what Jeremy is showing. But unless you have dampening, momentum doesnât have the 0.1. Check the PyTorch source code
Youâre not doubling it since your old contributions have 0.9**i after i iterations, and that get to 0 pretty quickly.