precompute=True

wgpubs · November 3, 2017, 6:47pm

Am I right to assume, that in the context of using resnet34, precompute=True means that we are telling the learner to precompute the activations from all BUT the last layer in resenet34?

If so, is there a way to tell it to precompute activations from all but a specified layer in the model? For example, to precompute from only the convolutional layers?

yinterian · November 3, 2017, 6:52pm

There is, look at ConvnetBuilder for the xtra_cut argument. (https://github.com/fastai/fastai/blob/master/fastai/conv_learner.py)

wgpubs · November 3, 2017, 7:15pm

In the code, I see that the layer manipulation uses the “model_meta” dictionary, but what does the model_meta dictionary values mean?

I see that resnet34 = [8,6]

Does that mean, “by default, included all layers up to, but not including, layer”? Given the torchvision source code here, how did you determine that the right cut-off target was 8?

Any what does the “6” refer too? I see it is somthing called “lr_cut”, but I’m not sure how to interpret it.

Thanks!

sermakarevich · November 3, 2017, 7:30pm

cut,self.lr_cut = self.model_meta[self.f]
cut-=xtra_cut
layers = cut_model(self.f(True), cut)

and
def cut_model(m, cut): return list(m.children())[:cut]

So

Yes! Up to first argument in a list: resnet18:[8,6] -8 in case of resnet18

wgpubs · November 3, 2017, 7:46pm

Cool.

So it looks like it is grabbing everything except the avg. pooling and the last FC layers:

r = resnet34()
list(r.children())[:8]

output:

[Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False),
 BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True),
 ReLU (inplace),
 MaxPool2d (size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1)),
 Sequential (
   (0): BasicBlock (
     (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
   )
   (1): BasicBlock (
     (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
   )
   (2): BasicBlock (
     (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
   )
 ),
 Sequential (
   (0): BasicBlock (
     (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     (downsample): Sequential (
       (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
       (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     )
   )
   (1): BasicBlock (
     (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
   )
   (2): BasicBlock (
     (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
   )
   (3): BasicBlock (
     (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
   )
 ),
 Sequential (
   (0): BasicBlock (
     (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (downsample): Sequential (
       (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
       (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     )
   )
   (1): BasicBlock (
     (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
   )
   (2): BasicBlock (
     (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
   )
   (3): BasicBlock (
     (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
   )
   (4): BasicBlock (
     (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
   )
   (5): BasicBlock (
     (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
   )
 ),
 Sequential (
   (0): BasicBlock (
     (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
     (downsample): Sequential (
       (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
       (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
     )
   )
   (1): BasicBlock (
     (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
   )
   (2): BasicBlock (
     (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
   )
 )]

It then looks like it is appending a AdaptiveConcatPool2d() layer + a new FC layer (output = # of classes).

sermakarevich · November 3, 2017, 7:52pm

Exactly. Adaptive and Flatten

layers += [AdaptiveConcatPool2d(), Flatten()]

We can control the depth of the cut with xtra_cut param:

cut-=xtra_cut

anandsaha · November 4, 2017, 1:50am

Lost here. What is it precomputing?

I understanding using pre trained weights, freezing layers while training, but not able to understand precomputing activations.

–

sanjeev.b · November 4, 2017, 2:07am

You just need to take your thought to the next step… using pre-trained weights, freezing layers and because of that the output from those sets of layers wont change in each epoc as you go through the inputs. So you can precompute the output from those layers. This makes training the later on layers whose weights are changing MUCH faster.

anandsaha · November 4, 2017, 2:09am

Aah got it! Thanks

yinterian · November 4, 2017, 9:15am

In the next few lectures Jeremy will explain the concept of differential learning rates. The basic idea is that you use different learning rates for different layers. If you are closer to the input learning rates should be lower and higher if you are closer to the output. “model_meta” is used to define 3 groups of layers for the purpose of using differential learning rates.

anandsaha · November 4, 2017, 5:12pm

Hello @jeremy

I see that we are chopping off the last two layres of resnet34

<class 'torch.nn.modules.conv.Conv2d'>
<class 'torch.nn.modules.batchnorm.BatchNorm2d'>
<class 'torch.nn.modules.activation.ReLU'>
<class 'torch.nn.modules.pooling.MaxPool2d'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
                --- cut off ---
<class 'torch.nn.modules.pooling.AvgPool2d'>
<class 'torch.nn.modules.linear.Linear'>

And then we are adding a bunch of other layers (almost double).

<class 'torch.nn.modules.conv.Conv2d'>
<class 'torch.nn.modules.batchnorm.BatchNorm2d'>
<class 'torch.nn.modules.activation.ReLU'>
<class 'torch.nn.modules.pooling.MaxPool2d'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
                --- add ---
<class 'fastai.layers.AdaptiveConcatPool2d'>
<class 'fastai.layers.Flatten'>
<class 'torch.nn.modules.batchnorm.BatchNorm1d'>
<class 'torch.nn.modules.dropout.Dropout'>
<class 'torch.nn.modules.linear.Linear'>
<class 'torch.nn.modules.activation.ReLU'>
<class 'torch.nn.modules.batchnorm.BatchNorm1d'>
<class 'torch.nn.modules.dropout.Dropout'>
<class 'torch.nn.modules.linear.Linear'>
<class 'torch.nn.modules.activation.LogSoftmax'>

I wanted to get an insight into why these are laid out the way they are, I believe these are important from transfer learning point of view. Is this something you will take up coming Monday? (In the meantime I am digging further)

–

jeremy · November 4, 2017, 5:34pm

We’ll be gradually covering it. But basically we’re adding two fully connected blocks, each containing batchnorm, dropout, and relu activation.

runze · November 6, 2017, 2:32am

I’m still a bit confused about this precompute argument, especially this statement in the lesson 1 notebook:

Review: easy steps to train a world-class image classifier
…
Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
…

This sentence makes me think setting precompute=False and data augmentation go hand in hand - is that the case? If so, in the tutorial code when we do data augmentation like below (with precompute=True), are we actually augmenting anything?

tfms = tfms_from_model(resnet34, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_paths(PATH, tfms=tfms)
learn = ConvLearner.pretrained(arch, data, precompute=True)

ar_ai · November 6, 2017, 2:50am

By default when we create a learner, it sets all but the last layer to frozen. That means that it’s still updating the weights in the last layer when we call fit. When we set precompute = False, it unfreezes all layers.

runze · November 6, 2017, 3:09am

Then what’s the difference between precompute=Fales and learn.unfreeze()?

ar_ai · November 6, 2017, 3:37am

Precompute = True/False indicates whether we want to use precomputed activations.
learn.freeze/learn.unfreeze indicates whether the parameters in the network are trainable or not during training of the network.
Irrespective of precompute is True/False, weights in the network will change during the training it the layers are unfreezed. ‘Precompute’ argument is for telling the network what to do before the training and ‘freeze/unfreeze’ argument is for telling the network what to do during the training.

runze · November 6, 2017, 4:25am

Thanks, I think I understand it now. Just to confirm,

If precompute=True, it doesn’t matter if learn.freeze() or learn.unfreeze(), does it? Since all the activation before the last layer is all pre-computed? (So it doesn’t really make sense to do learn.unfreeze() but precompute=True, does it?)
If learn.freeze(), setting precompute=False likely leads to similar (if not the same) results with precompute=True since the weights stay the same, is that right?

ar_ai · November 6, 2017, 4:57am

Precompute = True precomputes the activations for all but the last layers. But those activations can change during the training depending on whether layers are freezed or unfreezed. In the case, precompute =True, layer.unfreeze, Initial weights in the network are from precomputed activations. During the training, those weights changes in all the layers as now you have kepts weights in every layer as trainable parameter.
2)Precompute = False, learn.freeze() means only the last layer can learn now and all the others layers have random initialization of weights which are not optimised for any dataset. Hence, they are far away from local minima most of the times and setting every layer except last layer severely limits the learning ability of network.

abi · November 6, 2017, 7:24am

I don’t think I follow @ar_ai explanation of interactions between precompute + data augmentation + freeze/unfreeze. The concepts are a bit confusing to me still.

My understanding and intuition was also similar to @runze original understanding that precompute = False and data augmentation go hand-in-hand. Because during data augmentation we are slightly changing the images in the mini-batches we are generating, so we cannot lock the pre-computed activations. Hence while doing data augmentation we have to say precompute=False. Not really sure what happens if you say precompute=True and do data augmentation.

Moreover, I think:

pre-compute refers to activations before the last layer.
freeze/un-freeze refers to weights of the layers before the last layer

Hopefully, Jeremy is going to go over these details in the class tomorrow.

jeremy · November 6, 2017, 10:06am

This is incorrect. Only unfreeze or freeze_to do that.

If we’re precomputing, we can’t use data augmentation (since the precomputed activations are for some specific input, whereas augmentation changes it every time).