Yes, I was reading the paper today: here is where it is mentionned:
(page 5 of the original paper).
Note the equation g(BN(Wu)) where g is the non linearity, and it is applied after BN.
However, Jeremy puts it after in the notebook “batchnorm” of part2:
def conv_layer(ni, nf, ks=3, stride=2, bn=True, **kwargs):
layers = [nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias=not bn),
if bn: layers.append(nn.BatchNorm2d(nf, eps=1e-5, momentum=0.1))
So I came here to see if anyone was asking themselves this question and found your post.
Then here Jeremy says it seems to be better to put it after (so at least he’s been consistent), however he doesn’t point towards the experiments he is mentioning (and this was in 2016).
And then the discussion on the pytorch forums suggests otherwise. I suppose the difference in training will not be very high, so it wouldn’t matter too much in the end, but this clearly lacks clarity ^^
Have you done any experiment regarding this ?
@sgugger I’ll allow myself a @ mentionning here, since this is (somewhat related to the course), because the course offers an implementation that is not what the paper suggests… and it might be somewhat confusing for students (please let me know if you don’t consider this a proper use of @ ).
edit: from the original paper, in the conclusion: