Hey all, sorry again for another question, I’m debugging my implementation of NTS-Net so it’ll work through cnn_learner(kinda). Here’s the gist: after going through the source code for creating the head and base of the model, I’ve done the following:

```
net = attention_net()
learn = Learner(data, net, loss_func=total_loss, metrics=metric)
new_state_dict = torch.load(Path('model.ckpt'))['net_state_dict'] ### I can load it into learner
```

The above achieved loading their pre-trained model. I could not just grab the resnet weights as there is more than that I need, specifically a few layers in the head.

```
learn_state_dict = learn.model.state_dict()
for name, param in learn_state_dict.items():
if name in new_state_dict:
input_param = new_state_dict[name]
if input_param.shape == param.shape:
param.copy_(input_param)
else:
print('Shape mismatch')
learn.model.load_state_dict(learn_state_dict)
```

That just got my model ready to copy the weights over. From here I can do training, etc and use their weights. Since they are instantiated with the original weights, I wanted to split it from here.

```
def nts_body(m:nn.Module):
return nn.Sequential(*list(m.pretrained_model.children())[:8])
body = nts_body_cut(learn.model)
```

```
h1 = list(learn.model.pretrained_model.children())[8:]
h1[1] = nn.Linear(2048, data.c)
cn = nn.Linear(10240, data.c)
prt = nn.Linear(2048, data.c)
prp = list(learn.model.proposal_net.children())
```

Head generation is a bit disgusting at the moment:

```
head = nn.Sequential((*list(learn.model.pretrained_model.children())[8:]), *list(learn.model.proposal_net.children()), cn, prt)
```

But you can essentially see what I do here. Then I load it into a sequential,

`model = nn.Sequential(body, head)`

Make a learner:

```
learn = Learner(data, model, loss_func=mytotal_loss, metrics=metric)
```

Then split on where those last two layers we usually cut off for transfer learning on the resnet lie:

```
def _nts_cut(m:nn.Module)->List[nn.Module]:
return (m[0], m[1])
learn.split(split_on=_nts_cut);
```

Here is a small gist of the end of the model:

```
)
(2): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
)
)
(1): Sequential(
(0): AdaptiveAvgPool2d(output_size=1)
(1): Linear(in_features=2048, out_features=200, bias=True)
(2): Conv2d(2048, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(4): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(5): ReLU()
(6): Conv2d(128, 6, kernel_size=(1, 1), stride=(1, 1))
(7): Conv2d(128, 6, kernel_size=(1, 1), stride=(1, 1))
(8): Conv2d(128, 9, kernel_size=(1, 1), stride=(1, 1))
(9): Linear(in_features=10240, out_features=200, bias=True)
(10): Linear(in_features=2048, out_features=200, bias=True)
)
)
```

When I view the layer groups, everything looks in place, but when I do an `lr_find()`

it says that there is a size mismatch, `m1: [8192 x 1], m2: [2048 x 200]`

on the linear layer after that AdaptiveAvgPool. Why is that? This only happens when I try to split those layer groups, standard non-fit_one_cycle-enabled works fine. And I cannot just flatten this as I am keeping track of 4 different computations the entire time.

Thank you for any help,

Zach