I don’t have a good answer to that. I would say that it matters less, but that would not be a complete answer. I wonder if there are any relevant papers anyone came across on this?
In a sense, it is a bit like comparing apples and oranges. BatchNorm is a somewhat separate beast that comes with its issues. It has a compute cost and makes the memory footprint of your model larger (haven’t looked at the exact numbers but intuitively and to what extent my memory serves me well I think both of these costs are non trivial, there were some efforts to optimize the memory efficiency trading it for compute IIRC).
Also, as far as I understand, it acts as a regularizing factor. That means we are effectively throwing away some network complexity that maybe would be useful to have around.
It also makes training networks more complex, especially when small batches come into play, or where one batch can be significantly different from another. It also adds complexity for things like SWA (stochastic weight averaging) and means there is another thing we need to keep in mind while writing code - when to put the model into ‘learning mode’ where it recalculates the running stats and when not. There is also the momentum aspect of those stats, which when one moves to training with smaller batches becomes another consideration (and one that frameworks implement differently which leads to major confusion and code digging! at least keras and pytorch have different take on calculating of the momentum)
In summary, BatchNorm is this weird powerful beast that does something that seems to work very well but it comes with its set of headaches and it leaves the question open whether a different approach could bring better results (especially for practitioners having limited compute available).
I feel that any deliberation on the extent to which initialization matters less with BatchNorm should also include considerations of its costs.
Having said that, I also wonder from a practical perspective what results people have been able to get on something like imagenet or cifar purely without BatchNorm For instance, here I trained on cifar10 to 94% accuracy in around 13min 15sec on a single 1080TI.
Would be really, really fun to see if one could figure out how to initialize the network without using BatchNorm and see what would happen.
I remember from Twitter there was some person who trained on cifar10 in a very elaborate way (I remember Jeremy retweeting his tweet but I couldn’t find it when I looked for it some time ago) and got some great results. Now that I think of it, I wonder if they were using BatchNorm or not and what do people on the cutting edge generally do nowadays.
I do however also remember Jeremy having a phase for using batchnorm for input normalization which I think is really cool I don’t remember when was the last time I trained a model myself that didn’t leverage BN in one shape or form or the other. So the picture is rather complex