Questions about batch normalization

ben.bowles · November 22, 2016, 7:42pm

Based on the notes, it says I should always use Batch Normalization in modern networks. This really means I should have a Batch Normalization layer whenever I can have one. Are there really NO situations in which Batch normalization could hurt performance versus improve it?

It seems to me that Batch Normalization goes between the output of linear layer and a non-linearity (relu, etc). But, when it comes to dropout, should Batch Normalization go before or after? Maybe it makes little difference, still equally good in both cases?

jeremy · November 22, 2016, 8:17pm

You want the batchnorm after the non-linearity, and before the dropout.

Using batchnorm in RNNs requires care. This is an active area of recent research. I’m not aware of situations where batchnorm hurts CNNs.

berkmeister · December 31, 2016, 11:27am

@jeremy Thank you, all of this is incredibly helpful. I wanted to clarify something. Above you suggest applying BatchNorm before the dropout. I was going through the notebooks to try to find a pattern and understand where to use it:

In Lesson3.ipynb, when we first used Batchnorm, I can see that we applied it after the Dense layer and relu activation, but also after the dropout.

# You should put BatchNorm after Conv layers, and Dense layers as well
def get_bn_layers(p):
    return [
        MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
        Flatten(),
        Dense(4096, activation='relu'),
        Dropout(p),
        BatchNormalization(),
        Dense(4096, activation='relu'),
        Dropout(p),
        BatchNormalization(),
        Dense(1000, activation='softmax')
        ]

On the other hand, in mnist.ipynb, we apply it after the Dense layer with the nonlinear activation (relu) but before the dropout as you recommend in this thread.

def get_model_bn_do():
    model = Sequential([
        Lambda(norm_input, input_shape=(1,28,28)),
        Convolution2D(32,3,3, activation='relu'),
        BatchNormalization(axis=1),
        Convolution2D(32,3,3, activation='relu'),
        MaxPooling2D(),
        BatchNormalization(axis=1),
        Convolution2D(64,3,3, activation='relu'),
        BatchNormalization(axis=1),
        Convolution2D(64,3,3, activation='relu'),
        MaxPooling2D(),
        Flatten(),
        BatchNormalization(),
        Dense(512, activation='relu'),
        BatchNormalization(),
        Dropout(0.5),
        Dense(10, activation='softmax')
        ])
    model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

What is the difference between the two? Is the Lesson3 usage (after dropout) an exception due to the pretrained VGG activations? Is it safe to use it always Before drop-out as a rule of thumb?

Note: Btw, I am not sure how reliable it is, but on this link, a couple of people recommended to use Batchnorm between the linear output and non-linear activation if I understand correctly. Any thoughts?

rachel · December 31, 2016, 10:06pm

@berkmeister Good catch! That is a mistake in the Lesson 3 Notebook. BatchNorm should be applied before Dropout.

jeremy · December 31, 2016, 10:25pm

Well spotted. When I implemented this I did some searching and more recent advice seems to be to put it after the non-linearity, based on some experiments.

melissa.fabros · January 19, 2017, 10:27pm

Is BatchNormalization a memory intensive process? The model does seem to train faster but the clock time that it takes to process an epoch seems to be longer. Just wondering why this might be so

jeremy · January 21, 2017, 12:37am

You can look at the model summary to see how many params are in each bn layer. In most models it won’t be a large proportion, so it shouldn’t impact memory much. However it may impact computation time - it depends a lot on how it’s implemented in practice. It would be interesting to see the results if you wanted to try benchmarking models with an without bn in each of theano and tensorflow to see how long each epoch takes.

wminshew · January 21, 2017, 1:09am

@rachel Judging by the examples we shouldn’t be putting batchnorm between conv & maxpool layers either?

and i asssume you could put it before flatten instead of after, but would need to include axis=1?

tmu · April 25, 2017, 8:20pm

There is interesting discussion about BatchNorm and ReLU in the /r/MachineLearning thread https://www.reddit.com/r/MachineLearning/comments/67gonq/d_batch_normalization_before_or_after_relu/

User ReginaldIII comments:

From a statistics point of view BN before activation does not make sense to me. BN is normalizing the distribution of features coming out of a convolution, some these features might be negative which will be truncated by a non-linearity like ReLU. If you normalize before activation you are including these negative values in the normalization immediately before culling them from the feature space. BN after activation will normalize the positive features without statistically biasing them with features that do not make it through to the next convolutional layer.

msp · August 9, 2017, 9:27am

Segment on Batch Normalization from Karpathy’s CS231n course (2016): https://youtu.be/gYpoJMlgyXA?list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC&t=3079

danialk · August 9, 2017, 12:21pm

Hi Jeremy, thank you very much for your amazing courses. I was just looking at Andrej Karpathy’s lecture on batchnorm and as you can see in the slides in the link below, it is saying to put the batchnorm before the non-liniearlity. Any idea which one is correct or more effective in practice?

raspstephan · September 12, 2017, 8:46am

Hi,
I am working on the Statefarm dataset and I am baffled by how important BatchNormalization seems to be.

I am using the VGG model and precomputed the convolutional layers. Then I wanted to just build some simple models on top. At first I just just built a simple model without batch normalization, but as you can see it wasn’t able to learn at all (first model in picture). 10% accuracy = random guess. Then with batch normalization, the training is very fast (second model in picture).

If I load the weights from the original VGG model and only train the last layer, I can get the model to train without batch norm, but the accuracy is still much lower (45%)

So I guess I have two questions:

Why does adding batch norm have such a huge impact here?
How were models able to train without batch norm at all? Much more data, much longer training? Or am I missing something here?

Schubert · November 9, 2017, 5:29am

Okay, so if I understood it right you said that it is common practice now to apply the batchnorm after the non-linearity, and not between the linear layer and the following activation.

Is there an easy explaination to understand that recommendation? Because in the original paper they explain that batchnorm should be between the linear layer and the non-linear activation, because that makes sure that the normalized inputs to the non-linearity lie within the interesting non-linear area of the function, which I find very plausible.

Why should it make more sense to apply the batchnorm after the non-linearity?

JorisFournel · August 17, 2018, 3:49am

Hi,
Does anyone knows why full whitening of each layers input is not everywhere differentiable and why independent scalar feature whitening is supposed to be ?
Thanks

rohan.ml · October 5, 2019, 3:46pm

May I know why do we use batch normalization for continuous variables exclusively?