precompute=True

Cool.

So it looks like it is grabbing everything except the avg. pooling and the last FC layers:

r = resnet34()
list(r.children())[:8]

output:

[Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False),
 BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True),
 ReLU (inplace),
 MaxPool2d (size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1)),
 Sequential (
   (0): BasicBlock (
     (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
   )
   (1): BasicBlock (
     (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
   )
   (2): BasicBlock (
     (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
   )
 ),
 Sequential (
   (0): BasicBlock (
     (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     (downsample): Sequential (
       (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
       (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     )
   )
   (1): BasicBlock (
     (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
   )
   (2): BasicBlock (
     (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
   )
   (3): BasicBlock (
     (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
   )
 ),
 Sequential (
   (0): BasicBlock (
     (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (downsample): Sequential (
       (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
       (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     )
   )
   (1): BasicBlock (
     (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
   )
   (2): BasicBlock (
     (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
   )
   (3): BasicBlock (
     (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
   )
   (4): BasicBlock (
     (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
   )
   (5): BasicBlock (
     (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
   )
 ),
 Sequential (
   (0): BasicBlock (
     (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
     (downsample): Sequential (
       (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
       (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
     )
   )
   (1): BasicBlock (
     (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
   )
   (2): BasicBlock (
     (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
     (relu): ReLU (inplace)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
   )
 )]

It then looks like it is appending a AdaptiveConcatPool2d() layer + a new FC layer (output = # of classes).

Exactly. Adaptive and Flatten

layers += [AdaptiveConcatPool2d(), Flatten()]

We can control the depth of the cut with xtra_cut param:

cut-=xtra_cut

Lost here. What is it precomputing?

I understanding using pre trained weights, freezing layers while training, but not able to understand precomputing activations.

3 Likes

You just need to take your thought to the next step… using pre-trained weights, freezing layers and because of that the output from those sets of layers wont change in each epoc as you go through the inputs. So you can precompute the output from those layers. This makes training the later on layers whose weights are changing MUCH faster.

25 Likes

Aah got it! Thanks :slight_smile:

In the next few lectures Jeremy will explain the concept of differential learning rates. The basic idea is that you use different learning rates for different layers. If you are closer to the input learning rates should be lower and higher if you are closer to the output. “model_meta” is used to define 3 groups of layers for the purpose of using differential learning rates.

6 Likes

Hello @jeremy

I see that we are chopping off the last two layres of resnet34

<class 'torch.nn.modules.conv.Conv2d'>
<class 'torch.nn.modules.batchnorm.BatchNorm2d'>
<class 'torch.nn.modules.activation.ReLU'>
<class 'torch.nn.modules.pooling.MaxPool2d'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
                --- cut off ---
<class 'torch.nn.modules.pooling.AvgPool2d'>
<class 'torch.nn.modules.linear.Linear'>

And then we are adding a bunch of other layers (almost double).

<class 'torch.nn.modules.conv.Conv2d'>
<class 'torch.nn.modules.batchnorm.BatchNorm2d'>
<class 'torch.nn.modules.activation.ReLU'>
<class 'torch.nn.modules.pooling.MaxPool2d'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
<class 'torch.nn.modules.container.Sequential'>
                --- add ---
<class 'fastai.layers.AdaptiveConcatPool2d'>
<class 'fastai.layers.Flatten'>
<class 'torch.nn.modules.batchnorm.BatchNorm1d'>
<class 'torch.nn.modules.dropout.Dropout'>
<class 'torch.nn.modules.linear.Linear'>
<class 'torch.nn.modules.activation.ReLU'>
<class 'torch.nn.modules.batchnorm.BatchNorm1d'>
<class 'torch.nn.modules.dropout.Dropout'>
<class 'torch.nn.modules.linear.Linear'>
<class 'torch.nn.modules.activation.LogSoftmax'>

I wanted to get an insight into why these are laid out the way they are, I believe these are important from transfer learning point of view. Is this something you will take up coming Monday? (In the meantime I am digging further)

1 Like

We’ll be gradually covering it. But basically we’re adding two fully connected blocks, each containing batchnorm, dropout, and relu activation.

2 Likes

I’m still a bit confused about this precompute argument, especially this statement in the lesson 1 notebook:

Review: easy steps to train a world-class image classifier

Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1

This sentence makes me think setting precompute=False and data augmentation go hand in hand - is that the case? If so, in the tutorial code when we do data augmentation like below (with precompute=True), are we actually augmenting anything?

tfms = tfms_from_model(resnet34, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_paths(PATH, tfms=tfms)
learn = ConvLearner.pretrained(arch, data, precompute=True)
1 Like

By default when we create a learner, it sets all but the last layer to frozen. That means that it’s still updating the weights in the last layer when we call fit. When we set precompute = False, it unfreezes all layers.

1 Like

Then what’s the difference between precompute=Fales and learn.unfreeze()?

1 Like

Precompute = True/False indicates whether we want to use precomputed activations.
learn.freeze/learn.unfreeze indicates whether the parameters in the network are trainable or not during training of the network.
Irrespective of precompute is True/False, weights in the network will change during the training it the layers are unfreezed. ‘Precompute’ argument is for telling the network what to do before the training and ‘freeze/unfreeze’ argument is for telling the network what to do during the training.

9 Likes

Thanks, I think I understand it now. Just to confirm,

  • If precompute=True, it doesn’t matter if learn.freeze() or learn.unfreeze(), does it? Since all the activation before the last layer is all pre-computed? (So it doesn’t really make sense to do learn.unfreeze() but precompute=True, does it?)
  • If learn.freeze(), setting precompute=False likely leads to similar (if not the same) results with precompute=True since the weights stay the same, is that right?
2 Likes
  1. Precompute = True precomputes the activations for all but the last layers. But those activations can change during the training depending on whether layers are freezed or unfreezed. In the case, precompute =True, layer.unfreeze, Initial weights in the network are from precomputed activations. During the training, those weights changes in all the layers as now you have kepts weights in every layer as trainable parameter.
    2)Precompute = False, learn.freeze() means only the last layer can learn now and all the others layers have random initialization of weights which are not optimised for any dataset. Hence, they are far away from local minima most of the times and setting every layer except last layer severely limits the learning ability of network.
6 Likes

I don’t think I follow @ar_ai explanation of interactions between precompute + data augmentation + freeze/unfreeze. The concepts are a bit confusing to me still.

My understanding and intuition was also similar to @runze original understanding that precompute = False and data augmentation go hand-in-hand. Because during data augmentation we are slightly changing the images in the mini-batches we are generating, so we cannot lock the pre-computed activations. Hence while doing data augmentation we have to say precompute=False. Not really sure what happens if you say precompute=True and do data augmentation.

Moreover, I think:

  • pre-compute refers to activations before the last layer.
  • freeze/un-freeze refers to weights of the layers before the last layer

Hopefully, Jeremy is going to go over these details in the class tomorrow.

This is incorrect. Only unfreeze or freeze_to do that.

If we’re precomputing, we can’t use data augmentation (since the precomputed activations are for some specific input, whereas augmentation changes it every time).

16 Likes

ok…looks like I got it totally wrong. Let me try again.
Precompute = True/False - Precomputed activations that we feed into the network except for the last layer.
freeze/unfreeze - weights changes or not during the training. Is this correct?

Yes that’s reasonably correct, although some details are missing… E.g. freeze/unfreeze refer to all but the last layer, but there’s also freeze_to(idx) which freeze layers up to (but not including) layer number idx.

1 Like

Yes, I noticed that.

Hi @jeremy You will be covering these things in detail in the upcoming sessions right?. I am not able to correlate things here in this thread because of no prior knowledge.