It took me awhile to understand how a pre-trained model could use a different image size than the one it was trained on. I wrote up my notes on using resnet18
for MNIST classification. Maybe they will be helpful to someone.
How does the pre-trained model use our image size? The PyTorch resnet18
model was trained on ImageNet images. These images are 3-channel 224x224 pixels. Our images are 3-channel 28x28. How can we use pre-trained weights for a model trained on different images sizes? Let’s look at the first layers of the model instantiated as learn
:
Path: /kaggle/working/data, model=Sequential(
(0): Sequential(
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
:
:
:
The first layer is a 2d convolution. What is a convolution doing? The convolution slides a 7x7 matrix across each input channel of your image and does element-by-element multiplication summed against what it sees (this is sometimes described as matrix multiplication–wrong–or a dot prodcut. The dot product of the stretched out kernel (7x7=49 in this case) and the image segment is the same as element-by-element multuplcation summed). In this model, there are 64 kernels that slide across each channel. A key idea is that the elements of these 7x7 matrices are learned weights. The sizes of the kernels are fixed even though the image sizes are not. This means that the count of weights does not vary at all as a function of image size. This also makes clear why we needed to convert our one channel images to three channels: pre-trained resnet18
requires 3 input channels; these kernels are the weights. In addition to convolutional layers, the model has ReLU
layers too. However there are no weights here; this layer is just making some non-linearity of what the prior convolutional layer outputs. So how many weights are there in the first convolutional layer? 7 x 7 x 3 channels x 64 = 9408. At some point though, this logic breaks. At the end of the model, we take the last convolutional layer, connect it to a “fully-connected” layer and output to a certain number of classes. There are two things to understand about this. First let’s look at the actual pre-trained model from the PyTorch website (not our learn
instance):
:
:
:
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=1000, bias=True)
)
The last convolutional block feeds into an AdaptiveAvgPool2d(...)
layer which feeds into a fully-connected layer with 1000 outputs. These 1000 outputs are the ImageNet classes. Without going into details, the key idea of the AdaptiveAvgPool2d
layer is that it takes any input size and always outputs the same size. This is how we can use any input image size passing through the convolutional layers and make it fit a pre-defined fully-connected layer. However this last layer – this is where our idea breaks down: we don’t have 1000 classes, we have 10. To see how fastai deals with this, let’s look not at the model from the PyTorch website, but the actual model instantiated as learn
. When we look at the last layers of this model we see:
:
:
:
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(1): Sequential(
(0): AdaptiveConcatPool2d(
(ap): AdaptiveAvgPool2d(output_size=1)
(mp): AdaptiveMaxPool2d(output_size=1)
)
(1): Flatten()
(2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Dropout(p=0.25, inplace=False)
(4): Linear(in_features=1024, out_features=512, bias=True)
(5): ReLU(inplace=True)
(6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Dropout(p=0.5, inplace=False)
(8): Linear(in_features=512, out_features=10, bias=True)
)
Complex! fastai cut off the avgpool
and the fc
layers from the pre-trained model and then added a bunch of it’s own layers. These added layers are the only ones trained when you first call fit_one_cycle
. Notice that out_features=10
; fastai made this layer for us based on the number of classes in our data. And that’s it.