Ordering of layers

Could I please ask why we go batchnorm -> dropout -> linear -> activation? In some places the ordering suggested I believe is dropout -> linear -> batchnorm ->activation.

Do we have a strong reason for doing it this way or is it a matter of preference? Or did the ordering that we use turn out to be better in practice than sticking the BN right before activation?


Where exactly are we using these two for example ? Thanks

When you do ConvLearner.pretrained this is what you will get as the layers above the layer summarizing the conv block (on top of AdaptiveConcatPool2d)

Sequential (
  (0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)
  (1): Dropout (p = 0.25)
  (2): Linear (1024 -> 512)
  (3): ReLU ()
  (4): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)
  (5): Dropout (p = 0.5)
  (6): Linear (512 -> 2)
  (7): LogSoftmax ()

BTW the AdaptiveConcatPool2d is a very neat layer that enables all this - definitely worth taking a look at :slight_smile:

Thanks. As the little knowledge I have about batchnorm, for me doing it before dropout makes more sense in a statistical point of view since we want sample mean and std to be calculated with every member (activation. in this case). But this is just a guess, we will probably learn a ot about it in upcoming courses :slight_smile:

1 Like

I read some comments regarding trying these two different orderings on imagenet with resnet, and the one we’re using turned out to work better. @kcturgutlu’s intuition is also how I’ve thought about this, but I don’t know if that’s the real reason or not.

1 Like