Where should I place the batch normalization layer(s)?

Where should I place the BatchNorm layer, to train a great performance model?
(like CNN or RNN):flushed::flushed:

  1. Between each layer?:thinking:

  2. Just before or after the activation function layer?:thinking:

  3. Should before or after the activation function layer?:thinking:

  4. How about the convolution layer and pooling layer?:thinking:

And where I shouldn’t place the BatchNorm layer?

1 Like

You should put it after the non-linearity (eg. relu layer). If you are using dropout remember to use it before.
Ex:

Linear
Relu
Batchnorm
Dropout

3 Likes

Thank you for the reply.:grin:
But I got the exact opposite answer here.:rofl::rofl:

1 Like

Yes, I was reading the paper today: here is where it is mentionned:

(page 5 of the original paper).

Note the equation g(BN(Wu)) where g is the non linearity, and it is applied after BN.

However, Jeremy puts it after in the notebook “batchnorm” of part2:

#export
def conv_layer(ni, nf, ks=3, stride=2, bn=True, **kwargs):
    layers = [nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias=not bn),
              GeneralRelu(**kwargs)]
    if bn: layers.append(nn.BatchNorm2d(nf, eps=1e-5, momentum=0.1))
    return nn.Sequential(*layers)

So I came here to see if anyone was asking themselves this question and found your post.

Then here Jeremy says it seems to be better to put it after (so at least he’s been consistent), however he doesn’t point towards the experiments he is mentioning (and this was in 2016).

And then the discussion on the pytorch forums suggests otherwise. I suppose the difference in training will not be very high, so it wouldn’t matter too much in the end, but this clearly lacks clarity ^^

Have you done any experiment regarding this ?

@sgugger I’ll allow myself a @ mentionning here, since this is (somewhat related to the course), because the course offers an implementation that is not what the paper suggests… and it might be somewhat confusing for students (please let me know if you don’t consider this a proper use of @ ).

edit: from the original paper, in the conclusion:

image

The order between conv relu and bn, we follow the traditional resnet architecture (conv/bn/relu) most of the time, but the networks conv/relu/bn also seem to work to some extent. In v2, we added a parameter called bn_1st in ConvLayer to make it easy to change that order and experiment.

1 Like

Can you please explain the goal of using batch normalization after relu.
Relu will introduce sparsity and using normalization over it – isn’t this loss of information ?