Hey guys,

I tried to recreate the model from the *devise* notebook from scratch. I think I successfully recreated the model structure, however, the loss remains far higher than in the notebook. I would be really happy if someone can help to figure out where the difference to Jeremy’s implementation is, I got really curious now!

In the lecture notebook, the model is created like this:

`model = ConvnetBuilder(arch, modeldata.c, is_multi=False, is_reg=True, xtra_fc=[1024], ps=[0.2, 0.2])`

Those are the final layers of the architecture (after unfreezing so that we see everything).

```
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(8): AdaptiveConcatPool2d(
(ap): AdaptiveAvgPool2d(output_size=(1, 1))
(mp): AdaptiveMaxPool2d(output_size=(1, 1))
)
(9): Flatten()
(10): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): Dropout(p=0.2)
(12): Linear(in_features=1024, out_features=1024, bias=True)
(13): ReLU()
(14): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(15): Dropout(p=0.2)
(16): Linear(in_features=1024, out_features=300, bias=True)
)
```

I train on the reduced imagenet dataset (20%) and after 20 epochs the loss is down to:

```
epoch trn_loss val_loss
...
18 0.170165 0.215954
19 0.166288 0.21485
```

I tried to recreate this model the following way:

```
mod = torchvision.models.resnet34(pretrained=True)
class WordVecPredictor(nn.Module):
def __init__(self, base, p = 0.1):
self.name = "test"
super().__init__()
self.base = nn.Sequential(*list(mod2.children())[:-2])
for param in self.base:
param.requires_grad = False
self.adaptAvgPool = nn.AdaptiveAvgPool2d(output_size=(1,1))
self.adaptMaxPool = nn.AdaptiveMaxPool2d(output_size=(1,1))
self.batchNorm = nn.BatchNorm1d(num_features=1024)
self.batchNorm2 = nn.BatchNorm1d(num_features=1024)
self.drop = nn.Dropout(p)
self.lin1 = nn.Linear(in_features=1024, out_features=1024, bias=True)
self.lin2 = nn.Linear(in_features=1024, out_features=300, bias=True)
def forward(self, x):
x = self.base(x)
x = torch.cat((self.adaptAvgPool(x), self.adaptMaxPool(x)), dim=1)
x = x.view(x.size(0), -1)
x = self.batchNorm(x)
x = self.drop(x)
x = torch.nn.functional.relu(x)
x = self.batchNorm2(x)
x = self.drop(x)
return self.lin2(x)
model = WordVecPredictor(mod)
```

Which in the summary looks like this (only the end again):

```
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(adaptAvgPool): AdaptiveAvgPool2d(output_size=(1, 1))
(adaptMaxPool): AdaptiveMaxPool2d(output_size=(1, 1))
(batchNorm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(batchNorm2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(drop): Dropout(p=0.1)
(lin1): Linear(in_features=1024, out_features=1024, bias=True)
(lin2): Linear(in_features=1024, out_features=300, bias=True)
)
```

I then created a Learner, made sure that the learning rate is still the same with the lr finder and trained with the same settings.

**However, the loss does not decrease below 0.5 as opposed to ~0.2 and I would really like to know why **

Does anyone have an idea?

Best regards from Berlin!

Fabio