Image size agnostic image segmentation model with FC layer

Pomo · January 31, 2019, 6:31am

I have a convolutional image segmentation model that starts with a 100x100 image and ends with 100x100 by several classes. A fully connected head layer (Linear) reduces down to 2 classes. Suppose now I want to ‘warm up’ the model, fast training on half size 50x50 images. After that, I want to use what the FC layer has learned on the smaller images as a head start on the full-sized images.

Everything in the convolutional body scales to the smaller image size, but the last Linear layer does not. It takes 100^2 down to 2, but does not understand 50^2, much less how to warm start with weights trained with a smaller input. What would be a good way to handle this situation?

I have thought to use PyTorch nn.interpolate to upsample 50x50 to 100x100 before the FC layer. Is this a reasonable approach? Is interpolate() even differentiable? Alternatively, can an adaptive pooling make an image larger?

Thanks for any advice, even hints.

zearo · January 31, 2019, 11:22pm

If the network is fully convolutional, it will end up being the same size at the end as it was at the start. E.g. in your example, the 50x50 image will turn into a 50x50x2 image (with two channels for the two classes) and a 100x100 image would be 100x100x2.

You need to use convolution with a kernel size of 1 and n_channels = the number of output channels you want. You do not want to flatten the representation before doing that. Hope that helps!

Pomo · February 1, 2019, 6:06pm

Thanks for replying!

If the network is fully convolutional, it will end up being the same size at the end as it was at the start. E.g. in your example, the 50x50 image will turn into a 50x50x2 image (with two channels for the two classes) and a 100x100 image would be 100x100x2.

I understand this and agree. I am using a model which outputs a feature map the same image size as the input.

You need to use convolution with a kernel size of 1 and n_channels = the number of output channels you want. You do not want to flatten the representation before doing that.

Here’s where I get lost. Reading about 1x1xn convolutions, it seems they output a per-pixel linear combination of size n x (the input image size). But what I want to try (perhaps foolishly) is to flatten the convolutional output and apply a linear layer to every pixel activation. The issue is that with different input image sizes, that linear layer needs to have a different input size. Therefore the whole model does not adapt to different input image sizes.

Thanks for clarifying.

Pomo · February 1, 2019, 6:25pm

BTW, here’s another related question that has started bothering me while working with these models. I got the above idea from Jeremy’s lectures where he starts training with smaller, low resolution versions of the training set. Then he switches to high resolution later during training. In effect, the model is quickly pretrained, plus you get “fresh” images to avoid overfitting.

But why should it work? The lower res images are at a reduced scale. Each pretrained convolutional filter and pooling layer (resnet) will therefore cover more area in the smaller image. Every object will look to the model both lower res and smaller. For example, in low res, the model trains to recognize tiny dog-eyes. Why should we expect these weights to still be right when we switch to large dog-eyes, or larger forest textures, where the filters scan each image feature at a magnified scale?

This method certainly works empirically, but why?

zearo · February 4, 2019, 9:53am

They do output a per-pixel linear combination. However, each pixel in the penultimate layer has ‘seen’ values from a wide view (the receptive field) and hence is taking spatial information into account.

In short, the flattening is not necessary because the pixels already have spatial information.

zearo · February 4, 2019, 9:55am

My first idea would be that the ‘eye’ part of the network (normally a deep layer) will still be responsible for that at a higher resolution, but the network will adapt in its earlier layers to make sure the representation later on is the same.

I think you’re right that this could be problematic in the training, though.

Pomo · February 6, 2019, 11:12pm

Good theory. Hopefully, if I keep practicing the reasons and usefulness will become more clear.