Hello and great to meet you everybody,
I have been trying to write a custom implementation for DenseNet by adding a few layers after the final Dense Block. As suggested by Andrej Karpathy, I am trying to test the changes by overfitting the network on a small number of training examples (32 images).
Whilst doing this I came across a strange behaviour: if I include 2 nn.BatchNorm1d layers in the top of my network it becomes much harder to overfit on the 32 pictures.
The code for the network:
class BatchNormTest(nn.Module):
def __init__(self, top_dense_features=512, top_drop_rate=0, num_classes=1000):
super().__init__()
# Top
joined_num_features = 2352
self.top = nn.Sequential(OrderedDict([
('top_dense0', nn.Linear(joined_num_features, top_dense_features)),
('top_norm0', nn.BatchNorm1d(top_dense_features)),
('top_relu0', nn.ReLU(inplace=True)),
('top_dropout0', nn.Dropout(p=top_drop_rate, inplace=True)),
('top_dense1', nn.Linear(top_dense_features, top_dense_features)),
('top_norm1', nn.BatchNorm1d(top_dense_features)),
('top_relu1', nn.ReLU(inplace=True)),
('top_dropout1', nn.Dropout(p=top_drop_rate, inplace=True)),
('top_output', nn.Linear(top_dense_features, num_classes)),
]))
def forward(self, x):
x = x.view(x.shape[0], -1)
x_out = self.top(x)
return x_out
Training code:
model = BatchNormTest(data.c)
learn = Learner(data, model, wd=0)
learn.fit(50, lr=1e-2)
My results:
- without nn.BatchNorm1d for 50 epochs: train loss 0.683759
- with nn.BatchNorm1d for 50 epochs: train loss 3.11638
Is this behavior expected from nn.BatchNorm1d?