As Jeremy told in lecture 6:
“This is not momentum like in optimization, but this is momentum as in exponentially weighted moving average. Specifically this mean and standard deviation (in batch norm algorithm), we don’t actually use a different mean and standard deviation for every mini batch. If we did, it would vary so much that it be very hard to train. So instead, we take an exponentially weighted moving average of the mean and standard deviation.”
But in pytorch documentation it is written":
We use weighted average of mean and variance during evaluation only.
Am I missing something? Please suggest.
…and it says you have the running estimates during training and to my knowledge it uses them during training and resets them with every batch.
This also explains the BN problems with small bs, as the BN parameters can fluctuate too much with small bs (see the group norm paper).
In eval mode, I guess, you use the BN parameters derived from the entire training set.
It would be interesting to plot the BN parameters over training to get some intuition on how this works.
Yes, it says it keep running estimates during training for evaluation.
I did implementation of BN layer as part of CS231n, there I didn’t use running mean during training, therefore I am confused.
Jeremy also mentioned this momentum parameter can be used as hyper parameter for regularisation.
But I think we need to plot BN parameters to get better intuition.
We’ll be looking at this in great detail in the next couple of lessons
That would be great. Thanks