Feeding different sized images to fine tune ResNet34

I am asking the question refering to ResNet34 since that was the one Jeremy used in the lecture 3 to do transfer learning.

We feed different sized images to the network and it just worked fine. But how? Different image size is not a problem for convolutional operations but when we flatten at the last stages, it will have a different number of nodes at that fully connected layer and different number of weights corresponding to them. It may not be a problem if we would train the network from scratch, but how do we fine tune that if the number of parameters are different ?

Similarly, in that lecture someone asked what if our pre-trained model has images with 3 channels but we want to feed images with 4 channels? The answer was to add 1 more convolution operator to the stack to process that extra channel but I can’t visualize how that plays out when we have pre-trained model. Will all parameters corresponding to 3 channels that the model has start at the pre-trained model’s weights and the extra one channel we added will start at random ?

Also it will probably have similar issue when we come accross the residual blocks since we should make adjustments to have matched sizes at the start and the end of the residual blocks and different image sizes should have different adjustments etc…

Hi Ikadorus,

You are right that convolutional layers are size-agnostic. Afterwards, their spatial features are reduced by pooling layers. If you look at the resnet structure and study the PyTorch docs on pooling layers, you will understand how the model can have the same number of weights yet process different sized images.

To add a channel to a pretrained model, my favorite way is:

  • create the pretrained model
  • replace the first layer with a new conv2d having an additional input channel, and same number of outputs
  • initialize the weights of the new conv2d with the weights of the original conv2d, for the first three layers. Leave the fourth layer with its random weights.

There are other ways to accomplish the same thing, for example, taking linear combinations of the four channels to get three. You may have to experiment to find out which way works best for your task.

HTH,
Malcolm

2 Likes

“initialize the weights of the new conv2d with the weights of the original conv2d, for the first three layers. Leave the fourth layer with its random weights.”

Wow that is exactly what I was thinking, thank you very much for answering. But in this scenario I assume we should unfreeze and train the whole network since the extra channel at the first layer starts at random and just training the last layers won’t update its value ?

Also in residual block case because of the number of paddings we have to add is changed with different image size, we should also unfreeze and train the whole network. Am I right ?

Also what happens to the fully connected layer ? It seems like we can’t use pre-trained weights since the number of nodes are different, so should we delete and recreate that fully connected layer and start its weights and biases at random ?

I don’t know. Perhaps this is an empirical question. You could even initialize the new channel weights with the mean of the pretrained channel weights, and use the standard procedure.

Also in residual block case because of the number of paddings we have to add is changed with different image size, we should also unfreeze and train the whole network. Am I right ?

The main point is that after replacing the first convolutional layer, everything else in the model structure remains exactly the same. You do not have to adjust anything.

Also what happens to the fully connected layer ? It seems like we can’t use pre-trained weights since the number of nodes are different, so should we delete and recreate that fully connected layer and start its weights and biases at random ?

The fully connected layer(s) are in the head and are already randomly initialized. They are not pre-trained.

Suggestion: try it and learn. Practical experience will answer many theoretical questions.

The main point is that after replacing the first convolutional layer, everything else in the model structure remains exactly the same. You do not have to adjust anything.

I think this is only true for the channel. But if we feed different sized images, I think we should adjust the residual block

If you try implementing a four channel resnet34, we will discover whether the problem you anticipate is real or not. :slightly_smiling_face:

It took me awhile to understand how a pre-trained model could use a different image size than the one it was trained on. I wrote up my notes on using resnet18 for MNIST classification. Maybe they will be helpful to someone.

How does the pre-trained model use our image size? The PyTorch resnet18 model was trained on ImageNet images. These images are 3-channel 224x224 pixels. Our images are 3-channel 28x28. How can we use pre-trained weights for a model trained on different images sizes? Let’s look at the first layers of the model instantiated as learn:

  Path: /kaggle/working/data, model=Sequential(
  (0): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (4): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      :
      :
      :

The first layer is a 2d convolution. What is a convolution doing? The convolution slides a 7x7 matrix across each input channel of your image and does element-by-element multiplication summed against what it sees (this is sometimes described as matrix multiplication–wrong–or a dot prodcut. The dot product of the stretched out kernel (7x7=49 in this case) and the image segment is the same as element-by-element multuplcation summed). In this model, there are 64 kernels that slide across each channel. A key idea is that the elements of these 7x7 matrices are learned weights. The sizes of the kernels are fixed even though the image sizes are not. This means that the count of weights does not vary at all as a function of image size. This also makes clear why we needed to convert our one channel images to three channels: pre-trained resnet18 requires 3 input channels; these kernels are the weights. In addition to convolutional layers, the model has ReLU layers too. However there are no weights here; this layer is just making some non-linearity of what the prior convolutional layer outputs. So how many weights are there in the first convolutional layer? 7 x 7 x 3 channels x 64 = 9408. At some point though, this logic breaks. At the end of the model, we take the last convolutional layer, connect it to a “fully-connected” layer and output to a certain number of classes. There are two things to understand about this. First let’s look at the actual pre-trained model from the PyTorch website (not our learn instance):

    :
    :
    :
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=1000, bias=True)
)

The last convolutional block feeds into an AdaptiveAvgPool2d(...) layer which feeds into a fully-connected layer with 1000 outputs. These 1000 outputs are the ImageNet classes. Without going into details, the key idea of the AdaptiveAvgPool2d layer is that it takes any input size and always outputs the same size. This is how we can use any input image size passing through the convolutional layers and make it fit a pre-defined fully-connected layer. However this last layer – this is where our idea breaks down: we don’t have 1000 classes, we have 10. To see how fastai deals with this, let’s look not at the model from the PyTorch website, but the actual model instantiated as learn. When we look at the last layers of this model we see:

    : 
    : 
    : 
     (1): BasicBlock(
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
  )
  (1): Sequential(
    (0): AdaptiveConcatPool2d(
      (ap): AdaptiveAvgPool2d(output_size=1)
      (mp): AdaptiveMaxPool2d(output_size=1)
    )
    (1): Flatten()
    (2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.25, inplace=False)
    (4): Linear(in_features=1024, out_features=512, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.5, inplace=False)
    (8): Linear(in_features=512, out_features=10, bias=True)
  )

Complex! fastai cut off the avgpool and the fc layers from the pre-trained model and then added a bunch of it’s own layers. These added layers are the only ones trained when you first call fit_one_cycle. Notice that out_features=10; fastai made this layer for us based on the number of classes in our data. And that’s it.

8 Likes

Thanks for this reply, it will be very helpful for lots of people.

But what puzzled me to understand different sized/channeled images fed into network is that what happens at the residual blocks ?

@Ikadorus, the famous “ResNet block” above is the BasicBlock. We can simply see the names of the components (Conv2d, BatchNorm2d, ReLU) and the order and that they match the architecture diagram

image

Notice that the stride is 1, kernel size is 3x3, and (zero)padding is 1. As discussed in this great post, if you have a stride of 1 and set padding equal to (K-1)/2, then the output size will be the same as the input size. So, again, we are robust to whatever image size you want (image size, not number of channels; we must have 3 and only 3 channels) – the network just continues to pass your size through. But where is the addition step? If you look at the pytorch source code for BasicBlock() you will see it! It’s very simple… the forward pass through the block is

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

At start start of the BasicBlock we store the input as identity and at the end we add it back with out += identity.

4 Likes

If we use same padding it is true, but if our pretrained model uses valid paddings at its residual blocks can we say that we cannot feed different sized images to our model ?