How is the momentum initialized?

Hey there,
I’m a little bit confused about how the momentum for SGD is usually initialized…

I understand the concept that you always use the momentum of last mini batch and interpolate it linearly with the new gradient.

But what about the first mini batch? What do we use as the momentum there? Just the gradient?

In the excel spreadsheet it’s just a number I can’t backtrace to it’s origin…

Thank you in advance…

I think it is initiated with 0, so sometimes in the beginning of the training phase the momentum is dragged down by the initial 0 value. But there is such a thing called bias correction (which is used in Adam) that can alleviate this. Andrew Ng has a nice explaination here