How do ResNets run on different sized images if they have linear layers?

I had to write a resnet from scratch recently and it’s got me wondering: if the classification layer of say, a resnet34 is a linear layer with a fixed input size, how is the model able to read images of different sizes?

Fully-convolutional models make sense. But the resnet in an alphazero model needs an x.view(-1,19*19*2) operation in its forward method when it feeds the result of its conv-stem to the nn.Linear(19*19*2, n_classes) linear layer in one of its classifier heads. … But if I’m running this on a board that’s not 19x19 this should break…

So if it won’t work here but it works on imagery in general, what am I missing? Or I forgot something and it’s like the old VGG19 days where you need a specific input size – but that doesn’t make sense because progressive resizing is a thing.

I’ve been trying to dissect how pytorch’s resnet works but haven’t found an answer yet.

update: oh, so convnets / CNNs with linear layers can only run on data of specific dimensions [WAndB article]. Guess the shaping of different-sized images is handled by the dataloader – didn’t realize modern CNNs aren’t all FCNs. It’d be great if someone can confirm this, that’s what it looks like.


Just before the fully connected layer at the line 205 in the link, you can see there is a Adaptive Average Pool2d layer self.avgpool = nn.AdaptiveAvgPool2d((1, 1)) . Take a look at its doc here: AdaptiveAvgPool2d — PyTorch 1.12 documentation. It will resize any input size to a fix size of HxW → that make the model work with different input sizes

Hope it helps


Thanks a lot. I’m gonna think out loud & hopefully this’ll clear it up for anyone similarly confused.

So nn.AdaptiveAvgPool2d((h, w)) is going to convert the output of the previous conv layer into a (batch_size, channels, h, w) shaped tensor. But that’s not necessarily compatible for a matmul with the (in_features, out_features) shaped linear layer because we’d get (out_channels*h, w) x (in_feat, out_feat). Number of columns in the input/‘feature’ tensor have to match the number of rows the linear layer’s ‘parameter’ tensor. And that’s where line 279 comes in:

x = torch.flatten(x, 1)

which flattens our tensor into the 1D vector expected by the linear layer.

  • AdaptiveAveragePool((1,1)) ensures the tensor has the correct number of elements (we set the last conv layer to output as many channels as the next linear layer expects as features: 512*block.expansion in ResNet),
  • and torch.flatten(x, 1) reshapes the tensor into a 1D vector, that’s now compatible for a matmul with the following linear layer.


So the tensor shapes we’d see, assuming the last conv output is just 512 channels:

Conv output (h_conv,w_conv are height & width resulting from convolutions on the original image; this varies with image size):

(bs, 512, h_conv, w_conv)

AdaptAvgPool output ((h,w) = (1,1))

(bs, 512, 1, 1)

Flatten output:

(bs, 1, 512)

And now this is ready for a matmul with a linear layer of shape (512, n_out) Since it’s meant to run in parallel the actual matmul is a (1, 512) x (512, n_out).

1 Like

BTW the resnet chapter of the book explains this (and every other layer of a modern resnet) in detail.