How Convolutions Work: A Mini-Review
Consider the model specified by this call to get_cnn_model()
:
def get_cnn_model(data):
return nn.Sequential(
Lambda(mnist_resize), # 28x28 input layer
nn.Conv2d( 1, 8, 5, padding=2,stride=2), nn.ReLU(), # 8x14x14 first convolution layer
nn.Conv2d( 8,16, 3, padding=1,stride=2), nn.ReLU(), # 16x7x7 second convolution layer
nn.Conv2d(16,32, 3, padding=1,stride=2), nn.ReLU(), # 32x4x4 third convolution layer
nn.Conv2d(32,32, 3, padding=1,stride=2), nn.ReLU(), # 32x2x2 fourth convolution layer
nn.AdaptiveAvgPool2d(1), # 32x1x1 adaptive average pooling layer
Lambda(flatten), # 32x1 flatten
nn.Linear(32,data.c) # 10x1 output layer
)
Let’s consider the first convolution layer. The inputs to nnConv2d specify 1 input channel, 8 output channels, a 5\times5 kernel, zero-padding of 2 pixels, and a stride length of 2. Now the input image is 28\times28, and with 2 pixels of padding on each side the image array becomes 32\times32.
If we line up a 5\times5 kernel at an initial position at the top left of this image, then column 5 of the kernel lines up with column 5 of the image, and row 5 of the kernel lines up with row 5 of the image. Sliding the kernel rightwards across the image in steps (stride) of 2 pixels until there is no more room to slide yields a total of (32-5)//2 + 1 = 14 kernel positions, including the initial one. This is the size of the output matrix of the convolution.
We can generalize this formula for the convolution output size: when a (square) convolution kernel of size n_{kernel } is applied to a (square) array of size n_{input}, using padding of n_{pad} pixels and stride of n_{stride} pixels, the size of the (square) output matrix is
n_{out} = (n_{in}+ 2\times n_{pad} - n_{kernel} )//{n_{stride}} +1,
where // represents the floor division operation.
By the same reasoning, since the image is square, starting from the initial position and sliding the kernel downwards across the image in steps of 2 pixels, again yields 14 kernel positions.
At each of the 14\times14 kernel positions, the kernel is ‘applied’ on the overlapping 5\times5 portion of the image by multiplying each pixel value by its corresponding kernel array value. We sum these 5\times5=25 products to obtain a single scalar value, which is the pixel value at this position.
How many weights do we need to fit in this layer? The convolution kernel is 5\times5, and we need a different kernel for each of the 8 output channels. So altogether there are 5\times5\times8 = 200 weights to determine for the first layer, assuming the bias vector is all zeros.
The output of the first layer is a stack of 8 channels of 14\times14 arrays, which is a 8\times14\times14 array, which has 1568 entries.
How many parameters are needed for this model? Start with the convolution layers: 5\times5\times8 = 200 for the first, 3\times3\times16 = 144 for the second, 3\times3\times32 = 288 for the third, 3\times3\times32 = 288 for the fourth. The total for the convolution layers is 200 + 144 + 288 + 288 = 920 parameters. The final fully connected linear layer maps the vector of 32 values returned by the AdaptiveAveragePool
layer to the vector of 10 output parameters. This mapping requires a 32\times10 matrix, which has 320 parameters. The AdaptiveAvgPool
and Lambda
(flatten) layers have no parameters to solve. So the model has to determine a total number of 920 + 320 = 1240 parameters. Recall that for this calculation we have ignored bias parameters. The number of input data points for each example is 28\times28=784, and there are over 50,000 training examples, so the number of data points is vastly larger than the number of parameters!
Now test your understanding by working out the sizes of the remaining layers and checking against the answers provided in the comments to get_cnn_model()!