Lesson 3: Continue training 128x128 learner on 256x256 input -- how?

At https://github.com/hiromis/notes/blob/master/Lesson3.md#making-the-model-better-5030

A learner is initially trained on 128x128 images, but then trained on 256x256 images after that. Why is it possible? Aren’t the dimensions different?

It’s possible because modern architectures aren’t bound to an specific size. So, you can pass almost any image size (images smaller than 32px could have problems).

Finally, this training trick is called progressive resizing and helps to get more accurate results. Watch jeremy course V3 videos. He explains this a many more things in the course.

1 Like

had the same question and did some experiments :slight_smile: here’s what I found (copied from the 2020 course forum):

i had a hard time understanding the different input shapes - and why the architecture sill works - too :slight_smile:. I think I figured out how it works - if im wrong please let me know :smiley:

I built a Conv1d Resnet for inputs like Audio Files and first had the problem, that when the length of the input (audio-file) changed, I had to adopt the network architecture. Watching one of Jeremys other lessons (GAN from 2018 I think) he mentioned the difference between the different pooling layers (Average pooling vs adaptive average pooling especially). Using AdaptiveAvgPool*d does the trick :slight_smile: .

Heres an easy example for Conv1d with 1 input dimension (mono audio). I printed the shapes after each layer / resnet block to understand how the shapes change. And this example is a lot easier to understand compared to Conv2d with 3 input dimensions (RGB images). Because the tensors have just 3 dimensions [batch_size, number_output_kernels, “length”] instead of … a lot dimensions :wink:

Heres a 1 second 24414 hz mono audio [bs 64, 1 input channel, length: 24414 samples]:
in torch.Size([64, 1, 24414])
in conv torch.Size([64, 16, 24414])
in bn torch.Size([64, 16, 24414])
l1 torch.Size([64, 16, 24414])
l2 torch.Size([64, 32, 6104])
l3 torch.Size([64, 64, 1526])
l4 torch.Size([64, 128, 382])
l5 torch.Size([64, 256, 96])
l6 torch.Size([64, 512, 24])
avgpool torch.Size([64, 512, 1])

You can see, that the number of kernels increases (16 to 512) - that’s defined by your network architecture:

        self.conv = conv1x3(1, 16)
        self.bn = nn.BatchNorm1d(16)
        self.relu = nn.ReLU(inplace=True)
        self.layer1 = nn.Conv1D(block, 16, kernel_size)
        self.layer2 = nn.Conv1D(block, 32, kernel_size, stride)
        self.layer3 = nn.Conv1D(block, 64, kernel_size, stride)
        self.layer4 = nn.Conv1D(block, 128, kernel_size, stride)
        self.layer5 = nn.Conv1D(block, 256, kernel_size, stride)
        self.layer6 = nn.Conv1D(block, 512, kernel_size, stride)
        self.avg_pool = nn.AdaptiveAvgPool1d(1)

The last dimension changes according to the length / size of the input (from 24414 to 24 in the first example).

Now here’s a 0.75 second Audio clip with 18310 samples.

in torch.Size([64, 1, 18310])
in conv torch.Size([64, 16, 18310])
in bn torch.Size([64, 16, 18310])
l1 torch.Size([64, 16, 18310])
l2 torch.Size([64, 32, 4578])
l3 torch.Size([64, 64, 1145])
l4 torch.Size([64, 128, 287])
l5 torch.Size([64, 256, 72])
l6 torch.Size([64, 512, 18])
avgpool torch.Size([64, 512, 1])

you can see that the last dimension as the size of 18 here.

Now the trick: Adaptive Average Pooling will calculate the average of the tensor over the last dimension (independent of the size!) and will reduce it to 1 (nn.AdaptiveAvgPool1d(1) as specified here). Thats why the input shape doesn’t matter for (this kind of) CNNs.

1 Like

I think it is also fairly common to use both average and max pooling, simply to get more information through. Though this is with images, not particularly sure on audio.

Thanks for the help! So from what I understand, to train on larger inputs, you need to prepend your model with more layers to downsample the input (e.g. stride 2 conv, or pooling)

@florianl the adaptive pooling method you mentioned above helps up/downsample various square image sizes rather than just by powers of 2

No, actually the adaptive pooling enables the CNN to handle different input shapes WITHOUT any changes (no additional layers) to net architecture. It doesn’t downsample the images.

@marii yes, I just wanted to show the “adaptive” part of the pooling layers. I didn’t look into the advantages / disadvantages of max / avg or a combination.