Seems that what batch norm does is it calculates the running mean and sd (over all examples it ever sees?), normalizes inputs, and denormalizes them using trainable mean and sd. This for some strange reason makes the activations 'nicer` while still giving our network the ability to them whatever it pleases (though not sure how this is not contradictory to making them ‘nicer’ and why do we not end exactly where we started…).
The requires_grad = False will I guess prevent the trainable parameters from being learned (seems weight == sd, bias == mean), but that will not do anything for the calculation of the running mean / sd…
Under what circumstances would I ever want to fully freeze bn? The answer should probably be always if I do not finetune the layers directly above?
I can maybe imagine a contrived example where we have images that are wrt color not like imagenet, and we have only few of them, then maybe there would be value in freezing the conv layers and recalculating the mean / sd…
I am worried I am completely missing something here on how the batch norm freezing should be used?
Oddly enough, every library except fastai always updates the statistics in bn layers even when they’re “frozen”. But I’ve never found this to be a good idea!
As for when to bn_freeze (ie freeze the bn stats even if the layer is trainable) - I’m not at all sure yet. Some architectures like inception seem to absolutely require this. I’d be interested to hear what you learn if you do some experiments on different datasets and architectures.
AFAIK no-one else is currently looking at this issue at all!
After struggling to understand batch norm for a long time, I can think of below explanation, which I am putting down so @jeremy can correct me if I am wrong.
Every layer in a NN is learning a representation of the dataset it is trained on as a probabilistic distribution of weights. Batch norm is a process to bring numerical stability to these distributions from weights exploding by normalizing them. But the usual normalization was not possible before batch norm because the mean and std. dev of a layer’s distribution is not known before training. Batch norm brings these variables into the network and hence these also learnable, thus providing numerical stability.
@radek I am making a wild guess, but when we are tuning the NN by changing it’s weights would it be safe to preserve the distribution when tuning with small number of images ? If the new dataset is small, it should not be allowed to tamper the distribution too much. But if the new dataset is large and totally different to the images the network was initially trained with, it makes sense to unfreeze them at later layers ?