How much does better initialization help with BatchNorm?

PegasusWithoutWinds · April 2, 2019, 1:47pm

We discussed the importance of weight initializations in the first lesson. However, to what extent does it still matter with the presence of BatchNorm layer?

Note that this thread is not intending to undervalue good initializations, even though the title might make it sound so. Rather, it is more to have an intuition about how much a better initialization could help when you have a network with BatchNorm and bad initialization, for example very deep CNN with RELU using Xavier initialization.

Thanks to TensorFlow’s default variable initializer being Glorot Uniform, which is another name for Xavier Uniform, during work where TensorFlow is used, you get to see tons of existing projects trained with Xavier initialization, though with BatchNorm layer. So it would really help to have an intuition about how better initialization will help.

radek · April 2, 2019, 2:15pm

I don’t have a good answer to that. I would say that it matters less, but that would not be a complete answer. I wonder if there are any relevant papers anyone came across on this?

In a sense, it is a bit like comparing apples and oranges. BatchNorm is a somewhat separate beast that comes with its issues. It has a compute cost and makes the memory footprint of your model larger (haven’t looked at the exact numbers but intuitively and to what extent my memory serves me well I think both of these costs are non trivial, there were some efforts to optimize the memory efficiency trading it for compute IIRC).

Also, as far as I understand, it acts as a regularizing factor. That means we are effectively throwing away some network complexity that maybe would be useful to have around.

It also makes training networks more complex, especially when small batches come into play, or where one batch can be significantly different from another. It also adds complexity for things like SWA (stochastic weight averaging) and means there is another thing we need to keep in mind while writing code - when to put the model into ‘learning mode’ where it recalculates the running stats and when not. There is also the momentum aspect of those stats, which when one moves to training with smaller batches becomes another consideration (and one that frameworks implement differently which leads to major confusion and code digging! at least keras and pytorch have different take on calculating of the momentum)

In summary, BatchNorm is this weird powerful beast that does something that seems to work very well but it comes with its set of headaches and it leaves the question open whether a different approach could bring better results (especially for practitioners having limited compute available).

I feel that any deliberation on the extent to which initialization matters less with BatchNorm should also include considerations of its costs.

Having said that, I also wonder from a practical perspective what results people have been able to get on something like imagenet or cifar purely without BatchNorm For instance, here I trained on cifar10 to 94% accuracy in around 13min 15sec on a single 1080TI.

Would be really, really fun to see if one could figure out how to initialize the network without using BatchNorm and see what would happen.

I remember from Twitter there was some person who trained on cifar10 in a very elaborate way (I remember Jeremy retweeting his tweet but I couldn’t find it when I looked for it some time ago) and got some great results. Now that I think of it, I wonder if they were using BatchNorm or not and what do people on the cutting edge generally do nowadays.

I do however also remember Jeremy having a phase for using batchnorm for input normalization which I think is really cool I don’t remember when was the last time I trained a model myself that didn’t leverage BN in one shape or form or the other. So the picture is rather complex

kachio · April 2, 2019, 4:14pm

Yes, there’s a recent paper on arxiv: Weight Standardization (WS) that shed’s light on this topic. Below is a screen shot from the paper comparing the performance of their network initialized with weight standardization (plus group normalization) with batch size=1 compared to network trained with BN with large batch size.

JaxWang · September 27, 2020, 7:38am

Hi, could someone tell me how to realize an anbitrary dimension batchnorm layer, such as : batchnorm4d, batchnorm6d …