Kernel size = 1 and stride = 2

mraggi · January 5, 2019, 12:26am

Dear fast.ai users and developers,

I noticed that the downsample method used for the resnet networks works with stride 2 convolutions. That is fine, but what worries me is that kernel_size is set to… 1! Doesn’t that skip most of the image? My understading is that kernel_size=1 and stride=2 looks something like:

xoxo
oooo  ...
xoxo
oooo
  ...

where all the o’s don’t matter at all.

Do this:

import fastai.vision
fastai.vision.models.resnet18()

and you get a big description, including

(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)

MicPie · January 5, 2019, 7:02am

1x1 convs are used to reduce the parameter count in NNs (see this stackexchange thread and the original publication).

That stride 2 is skipping every second pixel would not shock me because dimension reduction with for example with max or average pooling would also discard a lot of information too (which is in the end what you do when you reduce the dimensions).

mraggi · January 5, 2019, 8:11am

Thank you for your reply.

1x1 convolutions don’t bother me at all. Stride 2 convolutions don’t bother me. What bothers me is the combination of the two. In fact, whenever you have kernel_size < stride, you’ll have some pixels that can’t contribute anything to the final answer (unless my understanding is incorrect).

Avg pooling doesn’t discard any information, and max pooling discards the information of the lower values, but that’s fine by me, as every pixel has an equal opportunity of being the maximum. I don’t understand why you’d simply discard 3/4 of the information. Those pixels have no chance of ever being taken into account. Why compute them at all?

I saw that this comes from pytorch. In torchvision.models.resnet* they have the same thing, so I’ll ask there.

anish.chhaparwal · August 7, 2020, 1:44pm

@mraggi I have the same doubt as you. Please update if you find the answer.

rwightman · August 14, 2020, 8:50pm

Bit of an old thread. kernel_size=1, stride=2 conv is the standard downsampling in the ResNet shortcut (identity) path. Yes, it throws away data, so does any basic sub-sampling image downscaling algorithm.

There are improvements on that aspect of ResNet. SENet used kernel_size=3. The ‘Bag of Tricks’ ResNet enhancements uses AvgPool before the 1x1, it uses AvgPool for spatial downsample and the 1x1 for channel matching.

The fastai xresnet uses avg pooling in the shortcut.