Slower convergence when adding BatchNorm1d layer

Stefan.Iarca · May 4, 2019, 6:28pm

Hello and great to meet you everybody,

I have been trying to write a custom implementation for DenseNet by adding a few layers after the final Dense Block. As suggested by Andrej Karpathy, I am trying to test the changes by overfitting the network on a small number of training examples (32 images).

Whilst doing this I came across a strange behaviour: if I include 2 nn.BatchNorm1d layers in the top of my network it becomes much harder to overfit on the 32 pictures.

The code for the network:

class BatchNormTest(nn.Module):
 def __init__(self, top_dense_features=512, top_drop_rate=0, num_classes=1000):
 super().__init__()

 # Top
 joined_num_features = 2352
 self.top = nn.Sequential(OrderedDict([
 ('top_dense0', nn.Linear(joined_num_features, top_dense_features)),
 ('top_norm0', nn.BatchNorm1d(top_dense_features)),
 ('top_relu0', nn.ReLU(inplace=True)),
 ('top_dropout0', nn.Dropout(p=top_drop_rate, inplace=True)),
 ('top_dense1', nn.Linear(top_dense_features, top_dense_features)),
 ('top_norm1', nn.BatchNorm1d(top_dense_features)),
 ('top_relu1', nn.ReLU(inplace=True)),
 ('top_dropout1', nn.Dropout(p=top_drop_rate, inplace=True)),
 ('top_output', nn.Linear(top_dense_features, num_classes)),
 ]))

 def forward(self, x):
  x = x.view(x.shape[0], -1)
  x_out = self.top(x)

  return x_out

Training code:

model = BatchNormTest(data.c)
learn = Learner(data, model, wd=0)
learn.fit(50, lr=1e-2)

My results:

without nn.BatchNorm1d for 50 epochs: train loss 0.683759
with nn.BatchNorm1d for 50 epochs: train loss 3.11638

Is this behavior expected from nn.BatchNorm1d?

KarlH · May 4, 2019, 7:39pm

Try it again with weight decay. Batchnorm has some weird effects where weights can get large and lower your effective learning rate.

Stefan.Iarca · May 5, 2019, 2:48pm

Thank you very much for your reply, Karl!

I’ve tried running some experiments with different weight decay values, but the behaviour was the same - BatchNorm performed worse.

However, one thing I have discovered is that when I increase the batch size to 16 (it was 2 in the initial experiments), adding the BatchNorm layers leads to faster convergence. Could the fact that my batch size of 2 was too small be the cause for the issue I encountered?

KarlH · May 5, 2019, 7:08pm

Yeah that’s definitely it. Batchnorm keeps a running track of batch statistics. If you use small batch sizes, you get very noisy estimates which leads to problems.