Could I please ask why we go batchnorm -> dropout -> linear -> activation? In some places the ordering suggested I believe is dropout -> linear -> batchnorm ->activation.
Do we have a strong reason for doing it this way or is it a matter of preference? Or did the ordering that we use turn out to be better in practice than sticking the BN right before activation?
Thanks. As the little knowledge I have about batchnorm, for me doing it before dropout makes more sense in a statistical point of view since we want sample mean and std to be calculated with every member (activation. in this case). But this is just a guess, we will probably learn a ot about it in upcoming courses
I read some comments regarding trying these two different orderings on imagenet with resnet, and the one we’re using turned out to work better. @kcturgutlu’s intuition is also how I’ve thought about this, but I don’t know if that’s the real reason or not.