Why MNIST works with ResNet without throwing error?

I’ve been training a model using resnet18 as a base and MNIST dataset. I was wondering though why this seems to work without problems although in theory MNIST images only have one channel and have 28x28 size while resnet input is expected to be 3 channels and 224x224.

I’ve been checking the fastai but I don’t quite understand why thing works without throwing errors. Is the input somehow “scaled” to go from 23x23x1 to 224x224x3? If so, where in the fastai code does it happen? or have I completely misunderstood how this works?

3 Likes

I think the image size and the number of channels are solved in two different ways.

For the image size we use a something called AdaptivePooling at the end of the convolutional layers.

I think the best way of understanding AdaptivePooling is with an example.

W=(W−F+2P)/S+1

W = (10-5+0)/0+1

Let’s say you have a 10x10 image.
What happens if you use a 3x3 filter in this image?
The resulting size is given by the formula:

new_W=((W−F+2P)/S) + 1

Where W is the image width (or height) F is the size of the kernel, P is the amount of padding and S is the stride.

So for our case:
new_W = (10 - 3 + 0) / 1 + 1
new_W = 8

So what would be the kernel size if you want to have a 4x4 image as result? Well, you just need to isolate F and solve for that. That is exactly what adaptive pooling does.

So when you use adaptive pooling instead of specifying the kernel size you specify the output size and it figures out the kernel size for you.

So going back to the question, FastAI adds a adaptive pooling layer at the end of the convolutional layer saying make the output be 1x1, and then no matter the shape of the incoming activations, the output of this layer will always be 1x1xn_filters, n_filters is independent of your image size, so this would work for any image you throw at it*.

Now for the channels, there’re some options, one of the is just replicating the single channel to the other ones. I don’t know exactly what fastai does.

4 Likes

hi! first thanks for taking the time to reply!

About AdaptivePooling though, I’m not sure it’s what does the trick here. AdaptivePooling is applied at the end of the network (“the head”) while the “problem” I have is with the bottom part of the network, i.e. the first layer where input are processed. A ResNet starts with this block:

(0): Sequential(
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(4): Sequential ....

So as you can see it starts with a “Conv2d(3,…”
I’ll keep digging in the code to see if any of the transformation/datablock functions turn the 1 channel MNIST images into a 3 channels images that resnet expects

Ok I have some clues. I noticed that when printing the ImageBunch the images already have 3 channels (see below). So either the MNIST dataset referenced by fastai are already 3 channel images or the Datablock API make them so. Will dig more and share my findings

ImageDataBunch;

Train: LabelList (60000 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
4,4,4,4,4
Path: /root/.fastai/data/mnist_png;

Valid: LabelList (10000 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
4,4,4,4,4
Path: /root/.fastai/data/mnist_png;

Test: None
1 Like

OK I found the answer to channel question: fastai by default converts any Image loaded via ImageList into an RGB image,calling the line below with convert_mode defaulting to RGB

PIL.Image.open(fn).convert(convert_mode)

see documentation for open: https://docs.fast.ai/vision.data.html#ImageList.open

2 Likes

I’m happy to know you figured out the channel trick, did you also understood why AdaptivePooling is solving the problem with the image size?

In your first response it seemed you were still not convinced by that

yes, I get what is going on now also with the size, thanks!

2 Likes

Hi,
You ended this conversation with:

“yes, I get what is going on now also with the size, thanks!”

I am still confused how the size issue is handled. I understand how the channels ares handled but where in the code is the the size handled - are we resizing the 28x28 to 224x224 before we do the Conv2d. I world certainly appreciate your advice.

rgds
bigThrum

after further observation of lesson1-pets.ipynb (just after " Other data formats") 2 more questions pop up.

In:
tfms = get_transforms(do_flip=False)
data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=26)
Why size = 26 not 28(the side length):

and

if I resize the image to size = 224, I get a massive increase in accuracy to:
1 0.029826 0.005539 0.998037 00:16

as opposed to:

1 0.124469 0.053129 0.983317 00:03
with size = 26

Shining any light on this would greatly aid my understanding.

thanks
bigThrum

Hi! sorry for the late reply. How the size problem is solved is explained well by @lgvaz earlier comment to my original question. In short, size really doesn’t matter for the earlier layer of a convolutional network, as all operations are independent of size. You only have a problem once you reach the final layers of your ResNet, where you have a regular “dense” network which is usually designed with a fixed amount of input neurons. To handle multiple input sizes you can simply use an “AdaptivePooling” layer right before you connect to a regular Neural Network. AdaptivePooling is designed to always give a fixed size output and to instead “adapt” the Kernel size based on the image size it’s being fed

I hope this answer your question!

I’m not sure why size here is 26 and not 28, maybe a typo, but as discussed above size doesn’t matter much in terms of making the network “work”.

I’m not sure I can provide any more insight into your question on accuracy.

giuseppe
many thanks, very helpful.