About image size, gradcam heatmaps, and other prodigious beasts

I’d like to ask a question or two about the ‘new’ pets notebook with grad-cam heatmaps (https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson6-pets-more.ipynb).

  1. Why do you set image size to 352 (instead of the usual 224 or 299 typical of imagenet transfer learning)?

  2. I tried to train with 299 imgs, and as result, the image filled just the upper-leftmost subsector in the heatmap overlay.

  3. As I got more curious, I tried and trained with 600x600 images, expecting them to be cropped (or automatically resized). But:

  • they were definitely not cropped
  • they were apparently not resized to an imagenet-compatible size (at least not in the heatmap overlay, which reported a shape of 600x600).
  • moreover, the feature maps were bigger accordingly, and this suggests the convnet worked internally with bigger images.
  • last but not least, the accuracy of the model improved substantially (hitting 0.996 vs 0.943 with 299, same dataset, kernel restarted, new learner and databunch instantiated, same number of epochs done with identical hyperparams).
  1. As you found out - you can get better accuracy with higher definition images - I had good results with 448 instead of 224 - but memory can become an issue. I’m guessing 352 was chosen as it is 224 + 128. I believe 224 is the size originally used for resnet34 and 299 for resnet50 - so those sizes may get the best initial feature match with transfer learning.
  2. You may need to change the extent in the imshow to match your image size.
  3. Not sure what you mean by trained with 600 x 600. If you specified 600 as you size then it would use the full image - but if you just had 299 still then it would be re-sizing (down-sampling) to 299. In my own classifier I used a variety of different sized and shaped images.
1 Like

I observed something similar when I training on FER data set (All images in the data set are 48x48). When I use size=48 the network underfits even after 40 epochs and doesn’t move past 70% accuracy. But with size=224 or 299 model hits 85% accuracy within a couple epochs! Curious what happens here… :thinking:


Just specified 600 as size for the databunch. The original images were much bigger.

Could you elaborate a bit more about this?

More importantly, I’d like to know how the resnet block can eat up images of different size, and produce feature maps accordingly, given that the number of nodes per layer does not change… I believe there is some adaptive layer at the beginning, and I’d like to know its exact mechanics, but the rest still puzzles me.

From my understanding - which may be wrong - all images will be the same size going in - in your example 600x600. The size of the layers though will be different if the starting size is 600x600 or 224x224. You should be able to see this in learn.summary(). For example - the Pets notebook starting with 224 as the size and a batch size of 64 looks like this at the beginning:

Layer (type) Output Shape Param # Trainable

Conv2d [64, 64, 112, 112] 9408 False

BatchNorm2d [64, 64, 112, 112] 128 True

ReLU [64, 64, 112, 112] 0 False

MaxPool2d [64, 64, 56, 56] 0 False

Then near the end is down to:
Conv2d [64, 512, 7, 7] 2359296 False

BatchNorm2d [64, 512, 7, 7] 1024 True

AdaptiveAvgPool2d [64, 512, 1, 1] 0 False

AdaptiveMaxPool2d [64, 512, 1, 1] 0 False

Lambda [64, 1024] 0 False

If if use a size of 600 (and I need to reduce the batch size to fit in my GPU memory (11GB) in my case - I start like this:

Layer (type) Output Shape Param # Trainable

Conv2d [32, 64, 300, 300] 9408 False

BatchNorm2d [32, 64, 300, 300] 128 True

ReLU [32, 64, 300, 300] 0 False

MaxPool2d [32, 64, 150, 150] 0 False

Conv2d [32, 64, 150, 150] 36864 False

And towards the end:

ReLU [32, 512, 19, 19] 0 False

Conv2d [32, 512, 19, 19] 2359296 False

BatchNorm2d [32, 512, 19, 19] 1024 True

AdaptiveAvgPool2d [32, 512, 1, 1] 0 False

AdaptiveMaxPool2d [32, 512, 1, 1] 0 False

Lambda [32, 1024] 0 False

So you can see that I maintain a higher resolution (19x19 rather than 7x7) towards the end - but the final stage will always be batch size * number of classes. I stopped short - but the final was 64,37 and 32, 37.
You are still traversing the data in ‘7x7’ or ‘3x3’ (thinking 2d) filters - so potentially able to find the same features in your higher resolution layers.
Does this help? Another thing to point out, although in your case you are starting with larger images - if you images are smaller than 600x600 then up-sampling will happen, which is not usually going to be a good thing as you may detect artifacts as features.

1 Like

Yes, definitely. Thank you.

Now a question comes to my mind. As you just pointed out, we got a model whose layers are adaptive (in width, i.e number of vertices) wrt the image size we specify. And we got to have a weight for every vertex.
But we know that with pretrained models, we have a predefined set of weights, already calculated.

If I’m not missing something, this should be incompatible with what we ascertained just above. What I’m not catching?


1 Like

Very good point. I don’t know how the existing weights extend to the larger space. @jeremy ? What are we missing here?

1 Like

If I may, I would also tag @sgugger, @stas, @cedric and @lesscomfortable about this, hoping I’m not infringing rules…

The pretrained weights you use are the weights for the convolutional kernels, so the height/width of your inputs don’t matter. Only the number of channels.


Thanks Karl - and after thinking about it I realised that must be the case - nothing gets remembered for the image as a whole.

You are, 4x so. I was reading this thread and about to respond when I got to the bottom and saw this - but now instead I’ll point out that tagging us on threads which are not specific to us, if done by everyone, would bury us in notifications and cause us to mute notifications, which would mean that no-one could then usefully tag us in the future. That is why the etiquette section of the FAQ explicitly requests that you don’t do this.

Sorry then. Won’t happen again.

Done in the spirit of “I’d like to hear your opinion since it is generally enlightening”.

Thanks! In the hindsight, it does make a lot of sense…

An interesting note from Stanford’s CS231N:

Constraints from pretrained models . Note that if you wish to use a pretrained network, you may be slightly constrained in terms of the architecture you can use for your new dataset. For example, you can’t arbitrarily take out Conv layers from the pretrained network.
However, some changes are straight-forward: Due to parameter sharing, you can easily run a pretrained network on images of different spatial size. This is clearly evident in the case of Conv/Pool layers because their forward function is independent of the input volume spatial size (as long as the strides “fit”). In case of FC layers, this still holds true because FC layers can be converted to a Convolutional Layer.
For example, in an AlexNet, the final pooling volume before the first FC layer is of size [6x6x512]. Therefore, the FC layer looking at this volume is equivalent to having a Convolutional Layer that has receptive field size 6x6, and is applied with padding of 0.

If you consider this along with the brilliant observation by @KarlH, I think we got the big picture!