Does BatchNorm Cause Overfitting?

I’ve been probing into BatchNorm, for the last few days, and I’m starting to doubt if it’s as useful as the academic papers claim it to be.

BatchNorm works well on academic datasets because they are from the same distribution. So that means the per training batch mean/variance is similar to the moving mean/variance of the training data, which is similar to the test batch mean/variance.

But in actual deployment scenarios, your training and test data are not from identical distributions (similar but not identical)… and if the network was trained to expect a specific mean/variance, does that means that the model is ‘overfitted’ to the training set?

Has anyone done any investigations to see if BatchNorm improves or worsen the generalizability of models, especially when the test data is not pulled from the same distribution?