Failed to recreate the DeViSe notebook architecture

fabiograetz · December 1, 2018, 2:30am

Hey guys,

I tried to recreate the model from the devise notebook from scratch. I think I successfully recreated the model structure, however, the loss remains far higher than in the notebook. I would be really happy if someone can help to figure out where the difference to Jeremy’s implementation is, I got really curious now!

In the lecture notebook, the model is created like this:

model = ConvnetBuilder(arch, modeldata.c, is_multi=False, is_reg=True, xtra_fc=[1024], ps=[0.2, 0.2])

Those are the final layers of the architecture (after unfreezing so that we see everything).

(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (8): AdaptiveConcatPool2d(
    (ap): AdaptiveAvgPool2d(output_size=(1, 1))
    (mp): AdaptiveMaxPool2d(output_size=(1, 1))
  )
  (9): Flatten()
  (10): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (11): Dropout(p=0.2)
  (12): Linear(in_features=1024, out_features=1024, bias=True)
  (13): ReLU()
  (14): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (15): Dropout(p=0.2)
  (16): Linear(in_features=1024, out_features=300, bias=True)
)

I train on the reduced imagenet dataset (20%) and after 20 epochs the loss is down to:

epoch      trn_loss   val_loss
...
18     0.170165   0.215954                             
19     0.166288   0.21485

I tried to recreate this model the following way:

mod = torchvision.models.resnet34(pretrained=True)                     

class WordVecPredictor(nn.Module):
    def __init__(self, base, p = 0.1):
        self.name = "test"
        super().__init__()
        self.base = nn.Sequential(*list(mod2.children())[:-2])
        for param in self.base:
            param.requires_grad = False
        
        self.adaptAvgPool = nn.AdaptiveAvgPool2d(output_size=(1,1))
        self.adaptMaxPool = nn.AdaptiveMaxPool2d(output_size=(1,1))
        self.batchNorm = nn.BatchNorm1d(num_features=1024)
        self.batchNorm2 = nn.BatchNorm1d(num_features=1024)

        self.drop = nn.Dropout(p)
        self.lin1 = nn.Linear(in_features=1024, out_features=1024, bias=True)
        self.lin2 = nn.Linear(in_features=1024, out_features=300, bias=True)
        
    def forward(self, x):
        x = self.base(x)
        x = torch.cat((self.adaptAvgPool(x), self.adaptMaxPool(x)), dim=1)
        x = x.view(x.size(0), -1)
        x = self.batchNorm(x)
        x = self.drop(x)
        x = torch.nn.functional.relu(x)
        x = self.batchNorm2(x)
        x = self.drop(x)
        return self.lin2(x)

model = WordVecPredictor(mod)

Which in the summary looks like this (only the end again):

(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
  )
  (adaptAvgPool): AdaptiveAvgPool2d(output_size=(1, 1))
  (adaptMaxPool): AdaptiveMaxPool2d(output_size=(1, 1))
  (batchNorm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (batchNorm2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (drop): Dropout(p=0.1)
  (lin1): Linear(in_features=1024, out_features=1024, bias=True)
  (lin2): Linear(in_features=1024, out_features=300, bias=True)
)

I then created a Learner, made sure that the learning rate is still the same with the lr finder and trained with the same settings.

However, the loss does not decrease below 0.5 as opposed to ~0.2 and I would really like to know why

Does anyone have an idea?

Best regards from Berlin!

Fabio

MicPie · December 2, 2018, 9:03am

Hey Fabio,

when I compare the end of model #1:

  (9): Flatten()
  (10): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (11): Dropout(p=0.2)
  (12): Linear(in_features=1024, out_features=1024, bias=True)
  (13): ReLU()
  (14): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (15): Dropout(p=0.2)
  (16): Linear(in_features=1024, out_features=300, bias=True)

with the end of model #2:

  (batchNorm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (batchNorm2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (drop): Dropout(p=0.1)
  (lin1): Linear(in_features=1024, out_features=1024, bias=True)
  (lin2): Linear(in_features=1024, out_features=300, bias=True)

I see the following differences:

Sequential order is not the same and does not incorporated the same building blocks:
model #1: BN, dropout (p = 0.2), linear, ReLU, BN, dropout (p = 0.2), linear
model #2: BN, BN, dropout (p = 0.1 and not 0.2, as in model #1), linear, linear
no ReLU in model #2?
Different probabilities for dropout (see point 1.: p of 0.2 vs. 0.1)

Did you print the models with the same approach? For model #1 you have the numbers included but for model #2 those are missing (and maybe due to that the order is mixed up?). When I print my custom models the numbering is shown (simple print with “model name” in a single jupyter cell).
You could also try the new learner.summary() function.

In your WordVecPredictor class you include the torch.nn.functional.relu(x) but it does not show up in the printed out model, which is strange. However, if there is a problem with this, I guess you would have stumbled over an error message before and would have not been able to train the model at all.

On top would be the different dropout probabilities.

Maybe you already fixed your problem? I would be happy to hear about it, as I am currently trying to build some custom model myself.

Kind regards
Michael

fabiograetz · December 2, 2018, 10:25pm

Hey Michael,

thanks for looking at this with me!
I printed both models with learn.model, I’m not sure why the output is different. Maybe since I did not use nn.Sequential? But in my opinion the forward method of the class I implemented, is the same as described in the summary of the original mode:

def forward(self, x):
        x = self.base(x)
        x = torch.cat((self.adaptAvgPool(x), self.adaptMaxPool(x)), dim=1)
        x = x.view(x.size(0), -1)
        x = self.batchNorm(x)
        x = self.drop(x)
        x = self.lin1(x)
        x = torch.nn.functional.relu(x)
        x = self.batchNorm2(x)
        x = self.drop(x)
        return self.lin2(x)

vs original model:

(8): AdaptiveConcatPool2d(
    (ap): AdaptiveAvgPool2d(output_size=(1, 1))
    (mp): AdaptiveMaxPool2d(output_size=(1, 1))
  )
  (9): Flatten()
  (10): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (11): Dropout(p=0.2)
  (12): Linear(in_features=1024, out_features=1024, bias=True)
  (13): ReLU()
  (14): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (15): Dropout(p=0.2)
  (16): Linear(in_features=1024, out_features=300, bias=True)
)

I missed one linear layer by accident in the first post, but I tried, the loss still does not decrease past 0.5
The dropout of 0.1 was a remainder of some experiments with lowering dropout and weight decay because I thought I might be underfitting because of too much regularization.

Are there any problems you can spot with my forward method?

Kind regards

Fabio

fabiograetz · December 11, 2018, 1:33pm

Ok guys, I figured this out

Here is the solution to my problem:

self.base = nn.Sequential(*list(mod2.children())[:-2])
        for param in self.base:
            param.requires_grad = False

I intended to use these three lines to freeze the backbone, however, apparently, this did not work as intended and the entire backbone was trained as well which lead to catastrophic forgetting and the different training result compared to the lesson.

This approach for building a model similar to what is shown in lesson 14 (segmentation with u-net) works:

cut, lr_cut = model_meta[backbone]

def get_base():
    layers = cut_model(backbone(True), cut)
    return nn.Sequential(*layers)

model_base = get_base()

class WordVecPredictorNew(nn.Module):
    def __init__(self, backbone, p=0.2):
        super().__init__()
        self.backbone = backbone
        self.features = nn.Sequential(
            backbone,
            AdaptiveConcatPool2d(1),
            Flatten(),
            nn.BatchNorm1d(1024),
            nn.Dropout(p),
            nn.Linear(in_features=1024, out_features=1024, bias=True),
            nn.ReLU(),
            nn.BatchNorm1d(1024),
            nn.Dropout(p),
            nn.Linear(in_features=1024, out_features=300, bias=True)
        )

    def forward(self, x):
        return self.features(x)

class WordVecPredictorModel():
    def __init__(self, model, name="wordvec_predictor"):
        self.model, self.name = model, name
        
    def get_layer_groups(self, precompute):
        layer_groups = list(split_by_idxs(list(self.model.backbone.children()), [lr_cut]))
        
        return layer_groups + [list(self.model.features.children())[1:]]

I get a very similar training behavior as with the model built using the ConvLearner that Jeremy used during class.

Hope, this helps someone
Best regards
Fabio