I’ve been training a model using resnet18 as a base and MNIST dataset. I was wondering though why this seems to work without problems although in theory MNIST images only have one channel and have 28x28 size while resnet input is expected to be 3 channels and 224x224.
I’ve been checking the fastai but I don’t quite understand why thing works without throwing errors. Is the input somehow “scaled” to go from 23x23x1 to 224x224x3? If so, where in the fastai code does it happen? or have I completely misunderstood how this works?
I think the image size and the number of channels are solved in two different ways.
For the image size we use a something called AdaptivePooling at the end of the convolutional layers.
I think the best way of understanding AdaptivePooling is with an example.
W=(W−F+2P)/S+1
W = (10-5+0)/0+1
Let’s say you have a 10x10 image.
What happens if you use a 3x3 filter in this image?
The resulting size is given by the formula:
new_W=((W−F+2P)/S) + 1
Where W is the image width (or height) F is the size of the kernel, P is the amount of padding and S is the stride.
So for our case:
new_W = (10 - 3 + 0) / 1 + 1
new_W = 8
So what would be the kernel size if you want to have a 4x4 image as result? Well, you just need to isolate F and solve for that. That is exactly what adaptive pooling does.
So when you use adaptive pooling instead of specifying the kernel size you specify the output size and it figures out the kernel size for you.
So going back to the question, FastAI adds a adaptive pooling layer at the end of the convolutional layer saying make the output be 1x1, and then no matter the shape of the incoming activations, the output of this layer will always be 1x1xn_filters, n_filters is independent of your image size, so this would work for any image you throw at it*.
Now for the channels, there’re some options, one of the is just replicating the single channel to the other ones. I don’t know exactly what fastai does.
About AdaptivePooling though, I’m not sure it’s what does the trick here. AdaptivePooling is applied at the end of the network (“the head”) while the “problem” I have is with the bottom part of the network, i.e. the first layer where input are processed. A ResNet starts with this block:
So as you can see it starts with a “Conv2d(3,…”
I’ll keep digging in the code to see if any of the transformation/datablock functions turn the 1 channel MNIST images into a 3 channels images that resnet expects
Ok I have some clues. I noticed that when printing the ImageBunch the images already have 3 channels (see below). So either the MNIST dataset referenced by fastai are already 3 channel images or the Datablock API make them so. Will dig more and share my findings
OK I found the answer to channel question: fastai by default converts any Image loaded via ImageList into an RGB image,calling the line below with convert_mode defaulting to RGB
“yes, I get what is going on now also with the size, thanks!”
I am still confused how the size issue is handled. I understand how the channels ares handled but where in the code is the the size handled - are we resizing the 28x28 to 224x224 before we do the Conv2d. I world certainly appreciate your advice.
Hi! sorry for the late reply. How the size problem is solved is explained well by @lgvaz earlier comment to my original question. In short, size really doesn’t matter for the earlier layer of a convolutional network, as all operations are independent of size. You only have a problem once you reach the final layers of your ResNet, where you have a regular “dense” network which is usually designed with a fixed amount of input neurons. To handle multiple input sizes you can simply use an “AdaptivePooling” layer right before you connect to a regular Neural Network. AdaptivePooling is designed to always give a fixed size output and to instead “adapt” the Kernel size based on the image size it’s being fed