How do ResNets run on different sized images if they have linear layers?

I had to write a resnet from scratch recently and it’s got me wondering: if the classification layer of say, a resnet34 is a linear layer with a fixed input size, how is the model able to read images of different sizes?

Fully-convolutional models make sense. But the resnet in an alphazero model needs an x.view(-1,19*19*2) operation in its forward method when it feeds the result of its conv-stem to the nn.Linear(19*19*2, n_classes) linear layer in one of its classifier heads. … But if I’m running this on a board that’s not 19x19 this should break…

So if it won’t work here but it works on imagery in general, what am I missing? Or I forgot something and it’s like the old VGG19 days where you need a specific input size – but that doesn’t make sense because progressive resizing is a thing.

I’ve been trying to dissect how pytorch’s resnet works but haven’t found an answer yet.


update: oh, so convnets / CNNs with linear layers can only run on data of specific dimensions [WAndB article]. Guess the shaping of different-sized images is handled by the dataloader – didn’t realize modern CNNs aren’t all FCNs. It’d be great if someone can confirm this, that’s what it looks like.

2 Likes

Just before the fully connected layer at the line 205 in the link, you can see there is a Adaptive Average Pool2d layer self.avgpool = nn.AdaptiveAvgPool2d((1, 1)) . Take a look at its doc here: AdaptiveAvgPool2d — PyTorch 1.12 documentation. It will resize any input size to a fix size of HxW → that make the model work with different input sizes

Hope it helps

4 Likes

Thanks a lot. I’m gonna think out loud & hopefully this’ll clear it up for anyone similarly confused.

So nn.AdaptiveAvgPool2d((h, w)) is going to convert the output of the previous conv layer into a (batch_size, channels, h, w) shaped tensor. But that’s not necessarily compatible for a matmul with the (in_features, out_features) shaped linear layer because we’d get (out_channels*h, w) x (in_feat, out_feat). Number of columns in the input/‘feature’ tensor have to match the number of rows the linear layer’s ‘parameter’ tensor. And that’s where line 279 comes in:

x = torch.flatten(x, 1)

which flattens our tensor into the 1D vector expected by the linear layer.

  • AdaptiveAveragePool((1,1)) ensures the tensor has the correct number of elements (we set the last conv layer to output as many channels as the next linear layer expects as features: 512*block.expansion in ResNet),
  • and torch.flatten(x, 1) reshapes the tensor into a 1D vector, that’s now compatible for a matmul with the following linear layer.

aha!

So the tensor shapes we’d see, assuming the last conv output is just 512 channels:

Conv output (h_conv,w_conv are height & width resulting from convolutions on the original image; this varies with image size):

(bs, 512, h_conv, w_conv)

AdaptAvgPool output ((h,w) = (1,1))

(bs, 512, 1, 1)

Flatten output:

(bs, 1, 512)

And now this is ready for a matmul with a linear layer of shape (512, n_out) Since it’s meant to run in parallel the actual matmul is a (1, 512) x (512, n_out).

1 Like

BTW the resnet chapter of the fast.ai book explains this (and every other layer of a modern resnet) in detail.

2 Likes