Hey guys,
I tried to recreate the model from the devise notebook from scratch. I think I successfully recreated the model structure, however, the loss remains far higher than in the notebook. I would be really happy if someone can help to figure out where the difference to Jeremy’s implementation is, I got really curious now!
In the lecture notebook, the model is created like this:
model = ConvnetBuilder(arch, modeldata.c, is_multi=False, is_reg=True, xtra_fc=[1024], ps=[0.2, 0.2])
Those are the final layers of the architecture (after unfreezing so that we see everything).
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(8): AdaptiveConcatPool2d(
(ap): AdaptiveAvgPool2d(output_size=(1, 1))
(mp): AdaptiveMaxPool2d(output_size=(1, 1))
)
(9): Flatten()
(10): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): Dropout(p=0.2)
(12): Linear(in_features=1024, out_features=1024, bias=True)
(13): ReLU()
(14): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(15): Dropout(p=0.2)
(16): Linear(in_features=1024, out_features=300, bias=True)
)
I train on the reduced imagenet dataset (20%) and after 20 epochs the loss is down to:
epoch trn_loss val_loss
...
18 0.170165 0.215954
19 0.166288 0.21485
I tried to recreate this model the following way:
mod = torchvision.models.resnet34(pretrained=True)
class WordVecPredictor(nn.Module):
def __init__(self, base, p = 0.1):
self.name = "test"
super().__init__()
self.base = nn.Sequential(*list(mod2.children())[:-2])
for param in self.base:
param.requires_grad = False
self.adaptAvgPool = nn.AdaptiveAvgPool2d(output_size=(1,1))
self.adaptMaxPool = nn.AdaptiveMaxPool2d(output_size=(1,1))
self.batchNorm = nn.BatchNorm1d(num_features=1024)
self.batchNorm2 = nn.BatchNorm1d(num_features=1024)
self.drop = nn.Dropout(p)
self.lin1 = nn.Linear(in_features=1024, out_features=1024, bias=True)
self.lin2 = nn.Linear(in_features=1024, out_features=300, bias=True)
def forward(self, x):
x = self.base(x)
x = torch.cat((self.adaptAvgPool(x), self.adaptMaxPool(x)), dim=1)
x = x.view(x.size(0), -1)
x = self.batchNorm(x)
x = self.drop(x)
x = torch.nn.functional.relu(x)
x = self.batchNorm2(x)
x = self.drop(x)
return self.lin2(x)
model = WordVecPredictor(mod)
Which in the summary looks like this (only the end again):
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(adaptAvgPool): AdaptiveAvgPool2d(output_size=(1, 1))
(adaptMaxPool): AdaptiveMaxPool2d(output_size=(1, 1))
(batchNorm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(batchNorm2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(drop): Dropout(p=0.1)
(lin1): Linear(in_features=1024, out_features=1024, bias=True)
(lin2): Linear(in_features=1024, out_features=300, bias=True)
)
I then created a Learner, made sure that the learning rate is still the same with the lr finder and trained with the same settings.
However, the loss does not decrease below 0.5 as opposed to ~0.2 and I would really like to know why
Does anyone have an idea?
Best regards from Berlin!
Fabio