Hi @vrodriguezf,
This is actually a very good question! Our study group covered these chapters a few weeks ago (we’re now on last few chapters of the book).
On your second point the about how the majority of the computation happens in the early layers is that as you progress deeper into the layers of the model, the channels might increase, but the grid size of the image decreases.
One way of seeing this by computing the size of the activation maps (which factors in the image size) which is roughly proportional to the amount of computation done by each layer. – just multiply together the computation for the output shape.
If you run the learn.summary
for a resnet18
on an MNIST
dataset for example, you can see that the activation maps for the early layers are much larger (because the grid size is still large) than the sizes of the activation maps of the deeper layers.
Sequential (Input shape: ['64 x 3 x 28 x 28'])
================================================================
Layer (type) Output Shape Param # Trainable
================================================================
Conv2d 64 x 64 x 14 x 14 9,408 True
________________________________________________________________
BatchNorm2d 64 x 64 x 14 x 14 128 True
________________________________________________________________
ReLU 64 x 64 x 14 x 14 0 False
________________________________________________________________
MaxPool2d 64 x 64 x 7 x 7 0 False
________________________________________________________________
Conv2d 64 x 64 x 7 x 7 36,864 True
________________________________________________________________
BatchNorm2d 64 x 64 x 7 x 7 128 True
________________________________________________________________
ReLU 64 x 64 x 7 x 7 0 False
________________________________________________________________
Conv2d 64 x 64 x 7 x 7 36,864 True
________________________________________________________________
BatchNorm2d 64 x 64 x 7 x 7 128 True
________________________________________________________________
Conv2d 64 x 64 x 7 x 7 36,864 True
________________________________________________________________
BatchNorm2d 64 x 64 x 7 x 7 128 True
________________________________________________________________
ReLU 64 x 64 x 7 x 7 0 False
________________________________________________________________
Conv2d 64 x 64 x 7 x 7 36,864 True
________________________________________________________________
BatchNorm2d 64 x 64 x 7 x 7 128 True
________________________________________________________________
Conv2d 64 x 128 x 4 x 4 73,728 True
________________________________________________________________
BatchNorm2d 64 x 128 x 4 x 4 256 True
________________________________________________________________
ReLU 64 x 128 x 4 x 4 0 False
________________________________________________________________
Conv2d 64 x 128 x 4 x 4 147,456 True
________________________________________________________________
BatchNorm2d 64 x 128 x 4 x 4 256 True
________________________________________________________________
Conv2d 64 x 128 x 4 x 4 8,192 True
________________________________________________________________
BatchNorm2d 64 x 128 x 4 x 4 256 True
________________________________________________________________
Conv2d 64 x 128 x 4 x 4 147,456 True
________________________________________________________________
BatchNorm2d 64 x 128 x 4 x 4 256 True
________________________________________________________________
ReLU 64 x 128 x 4 x 4 0 False
________________________________________________________________
Conv2d 64 x 128 x 4 x 4 147,456 True
________________________________________________________________
BatchNorm2d 64 x 128 x 4 x 4 256 True
________________________________________________________________
Conv2d 64 x 256 x 2 x 2 294,912 True
________________________________________________________________
BatchNorm2d 64 x 256 x 2 x 2 512 True
________________________________________________________________
ReLU 64 x 256 x 2 x 2 0 False
________________________________________________________________
Conv2d 64 x 256 x 2 x 2 589,824 True
________________________________________________________________
BatchNorm2d 64 x 256 x 2 x 2 512 True
________________________________________________________________
Conv2d 64 x 256 x 2 x 2 32,768 True
________________________________________________________________
BatchNorm2d 64 x 256 x 2 x 2 512 True
________________________________________________________________
Conv2d 64 x 256 x 2 x 2 589,824 True
________________________________________________________________
BatchNorm2d 64 x 256 x 2 x 2 512 True
________________________________________________________________
ReLU 64 x 256 x 2 x 2 0 False
________________________________________________________________
Conv2d 64 x 256 x 2 x 2 589,824 True
________________________________________________________________
BatchNorm2d 64 x 256 x 2 x 2 512 True
________________________________________________________________
Conv2d 64 x 512 x 1 x 1 1,179,648 True
________________________________________________________________
BatchNorm2d 64 x 512 x 1 x 1 1,024 True
________________________________________________________________
ReLU 64 x 512 x 1 x 1 0 False
________________________________________________________________
Conv2d 64 x 512 x 1 x 1 2,359,296 True
________________________________________________________________
BatchNorm2d 64 x 512 x 1 x 1 1,024 True
________________________________________________________________
Conv2d 64 x 512 x 1 x 1 131,072 True
________________________________________________________________
BatchNorm2d 64 x 512 x 1 x 1 1,024 True
________________________________________________________________
Conv2d 64 x 512 x 1 x 1 2,359,296 True
________________________________________________________________
BatchNorm2d 64 x 512 x 1 x 1 1,024 True
________________________________________________________________
ReLU 64 x 512 x 1 x 1 0 False
________________________________________________________________
Conv2d 64 x 512 x 1 x 1 2,359,296 True
________________________________________________________________
BatchNorm2d 64 x 512 x 1 x 1 1,024 True
________________________________________________________________
AdaptiveAvgPool2d 64 x 512 x 1 x 1 0 False
________________________________________________________________
AdaptiveMaxPool2d 64 x 512 x 1 x 1 0 False
________________________________________________________________
Flatten 64 x 1024 0 False
________________________________________________________________
BatchNorm1d 64 x 1024 2,048 True
________________________________________________________________
Dropout 64 x 1024 0 False
________________________________________________________________
Linear 64 x 512 524,288 True
________________________________________________________________
ReLU 64 x 512 0 False
________________________________________________________________
BatchNorm1d 64 x 512 1,024 True
________________________________________________________________
Dropout 64 x 512 0 False
________________________________________________________________
Linear 64 x 10 5,120 True
________________________________________________________________
Total params: 11,708,992
Total trainable params: 11,708,992
You will also note that even though the number of parameters increases in the deeper layers (before being reduced into the final number of output classes), the actual size of the activation maps grows smaller and smaller.
My mental picture for this is that the simple features get extracted by the early layers are combined into more complex ones by the deeper layers but the activation (or decision making) by the deeper neurons are actually fewer because the earlier layers already extracted the features necessary (theres a horizontal line near the top, and a diagonal line somewhere in the middle to the right of the image) so the deeper layer just has to decide whether this is a one or a seven – something like that…
Hope this answers your question.
Also pinging @marii and @tyoc213 to confirm my understanding
Best regards,
Butch