After struggling to understand batch norm for a long time, I can think of below explanation, which I am putting down so @jeremy can correct me if I am wrong.
Every layer in a NN is learning a representation of the dataset it is trained on as a probabilistic distribution of weights. Batch norm is a process to bring numerical stability to these distributions from weights exploding by normalizing them. But the usual normalization was not possible before batch norm because the mean and std. dev of a layer’s distribution is not known before training. Batch norm brings these variables into the network and hence these also learnable, thus providing numerical stability.
@radek I am making a wild guess, but when we are tuning the NN by changing it’s weights would it be safe to preserve the distribution when tuning with small number of images ? If the new dataset is small, it should not be allowed to tamper the distribution too much. But if the new dataset is large and totally different to the images the network was initially trained with, it makes sense to unfreeze them at later layers ?