Confused about the amount of compute in CNNs / ResNets

Continuing the discussion from Confusion about a concept written in fastbook (intuition of the amount computation in CNNs):

I read this thread because I had the exact same question. I’m sorry for repeating a topic, but even after the detailed answer from @butchland I’m still not feeling any wiser. Chapter 13 of the book explicitly calculates the number of multiplications for different layers and concludes:

What happened here is that our stride-2 convolution halved the grid size from 14x14 to 7x7, and we doubled the number of filters from 8 to 16, resulting in no overall change in the amount of computation.

Highlighting by me. So why do we use plain convolutional layers instead of ResNet blocks as the stem of a ResNet in chapter 14?

The reason that we have a stem of plain convolutional layers, instead of ResNet blocks, is based on a very important insight about all deep convolutional neural networks: the vast majority of the computation occurs in the early layers. Therefore, we should keep the early layers as fast and simple as possible.

What am I missing here?

The other question I have concerns another contradiction I find between these chapters. Chapter 13 states that we should use larger kernel sizes on the first layer when we want to create more channels. The reason:

Neural networks will only create useful features if they’re forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs.

In chapter 14 however, the first layer of the stem creates 32 (!) channels from 3 input channels with a kernel size of only 3x3

Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

Here we are computing 32 values from 9 pixels, how is this useful?

Thank you for any thoughts :slight_smile:

I don’t think this stuff is understood very well yet.

What is the best stem to use? Everyone appears to use something different, so there doesn’t seem to be consensus.

Has there been a study where they tried different stem designs while keeping the rest of the model constant? (If so, that only tells you something about the stem for that particular architecture, so it still might not be a universal truth.)

What you call “confusion” might just be a hunch that more work needs to be done in this area…

1 Like

Thank you, Matthijs! I’m halfway through your blog post about the topic, super interesting :slight_smile:

I’ll look around if I find any systematic approaches to different stems.

I would suggest to make yourself some very simple examples to try the math out by yourself.
For example:

x = torch.rand(1,1,2,2)
conv = nn.Conv2d(1,3,kernel_size=2,stride=1, padding=0, bias=True)

how many operations are here?
mults: 43 = 12
sums: 3
3 + 3 = 12

I made a small func to compute this:

def ops(conv, x):
    "compute operations of conv over x"
    h,w = x.shape[2:]
    in_channels = conv.in_channels
    out_channels = conv.out_channels
    kernel_size = conv.kernel_size
    bias = conv.bias
    stride = conv.stride
    padding = conv.padding
    y_ops = (h - kernel_size[0] + 2*padding[0]+1)//stride[0]
    x_ops = (w - kernel_size[1] + 2*padding[1]+1)//stride[1]
    total_kernel_pass = x_ops * y_ops
    mults = in_channels*out_channels*x_ops*y_ops*kernel_size[0]*kernel_size[1]
    sums =  in_channels*out_channels*x_ops*y_ops*(kernel_size[0]*kernel_size[1]-1) + len(ifnone(bias, []))*(x_ops*y_ops)
    return mults, sums

so if you do:

x = torch.rand(1,3,4,4)
conv = nn.Conv2d(3,6,kernel_size=3,stride=2, padding=1, bias=False)
out = conv(x)
ops(conv, x)
>>(648, 576)

A 3x3 kernel, operated 4 times on each of the 3 planes 6 times = 648.
and then the stride 1 conv:

conv2 = nn.Conv2d(6,6,kernel_size=3,stride=1,padding=1, bias=False)
ops(conv2, out)
>> (1296, 1152)

A 3x3 kernel, operated also 4 times (as the stride is 1) on each of the 6 planes 6 times = 1296.
It is twice as heavy.

1 Like

Thank you, Thomas, for this cool function. This is really useful to play around with!

But doesn’t it rather prove my doubt about the statement “the vast majority of the computation occurs in the early layers”? In your example (and in others with more realistic numbers that I tried), even when we don’t increase the number of filters in the second layer, the number of computations doubles.

The internal layers are heavy. Even on small images.
They have more params, but operate on “smaller” images.