Ordering of layers

Could I please ask why we go batchnorm -> dropout -> linear -> activation? In some places the ordering suggested I believe is dropout -> linear -> batchnorm ->activation.

Do we have a strong reason for doing it this way or is it a matter of preference? Or did the ordering that we use turn out to be better in practice than sticking the BN right before activation?

2 Likes

Where exactly are we using these two for example ? Thanks

When you do ConvLearner.pretrained this is what you will get as the layers above the layer summarizing the conv block (on top of AdaptiveConcatPool2d)

Sequential (
  (0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)
  (1): Dropout (p = 0.25)
  (2): Linear (1024 -> 512)
  (3): ReLU ()
  (4): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)
  (5): Dropout (p = 0.5)
  (6): Linear (512 -> 2)
  (7): LogSoftmax ()
)

BTW the AdaptiveConcatPool2d is a very neat layer that enables all this - definitely worth taking a look at :slight_smile:

Thanks. As the little knowledge I have about batchnorm, for me doing it before dropout makes more sense in a statistical point of view since we want sample mean and std to be calculated with every member (activation. in this case). But this is just a guess, we will probably learn a ot about it in upcoming courses :slight_smile:

1 Like

I read some comments regarding trying these two different orderings on imagenet with resnet, and the one we’re using turned out to work better. @kcturgutlu’s intuition is also how I’ve thought about this, but I don’t know if that’s the real reason or not.

1 Like