Slower convergence when adding BatchNorm1d layer

Hello and great to meet you everybody,

I have been trying to write a custom implementation for DenseNet by adding a few layers after the final Dense Block. As suggested by Andrej Karpathy, I am trying to test the changes by overfitting the network on a small number of training examples (32 images).

Whilst doing this I came across a strange behaviour: if I include 2 nn.BatchNorm1d layers in the top of my network it becomes much harder to overfit on the 32 pictures.

The code for the network:

class BatchNormTest(nn.Module):
 def __init__(self, top_dense_features=512, top_drop_rate=0, num_classes=1000):
 super().__init__()

 # Top
 joined_num_features = 2352
 self.top = nn.Sequential(OrderedDict([
 ('top_dense0', nn.Linear(joined_num_features, top_dense_features)),
 ('top_norm0', nn.BatchNorm1d(top_dense_features)),
 ('top_relu0', nn.ReLU(inplace=True)),
 ('top_dropout0', nn.Dropout(p=top_drop_rate, inplace=True)),
 ('top_dense1', nn.Linear(top_dense_features, top_dense_features)),
 ('top_norm1', nn.BatchNorm1d(top_dense_features)),
 ('top_relu1', nn.ReLU(inplace=True)),
 ('top_dropout1', nn.Dropout(p=top_drop_rate, inplace=True)),
 ('top_output', nn.Linear(top_dense_features, num_classes)),
 ]))

 def forward(self, x):
  x = x.view(x.shape[0], -1)
  x_out = self.top(x)

  return x_out

Training code:

model = BatchNormTest(data.c)
learn = Learner(data, model, wd=0)
learn.fit(50, lr=1e-2)

My results:

  • without nn.BatchNorm1d for 50 epochs: train loss 0.683759
  • with nn.BatchNorm1d for 50 epochs: train loss 3.11638

Is this behavior expected from nn.BatchNorm1d?

1 Like

Try it again with weight decay. Batchnorm has some weird effects where weights can get large and lower your effective learning rate.

Thank you very much for your reply, Karl!

I’ve tried running some experiments with different weight decay values, but the behaviour was the same - BatchNorm performed worse.

However, one thing I have discovered is that when I increase the batch size to 16 (it was 2 in the initial experiments), adding the BatchNorm layers leads to faster convergence. Could the fact that my batch size of 2 was too small be the cause for the issue I encountered?

1 Like

Yeah that’s definitely it. Batchnorm keeps a running track of batch statistics. If you use small batch sizes, you get very noisy estimates which leads to problems.

2 Likes