“This is not momentum like in optimization, but this is momentum as in exponentially weighted moving average. Specifically this mean and standard deviation (in batch norm algorithm), we don’t actually use a different mean and standard deviation for every mini batch. If we did, it would vary so much that it be very hard to train. So instead, we take an exponentially weighted moving average of the mean and standard deviation.”
…and it says you have the running estimates during training and to my knowledge it uses them during training and resets them with every batch.
This also explains the BN problems with small bs, as the BN parameters can fluctuate too much with small bs (see the group norm paper).
In eval mode, I guess, you use the BN parameters derived from the entire training set.
It would be interesting to plot the BN parameters over training to get some intuition on how this works.