Hi so I am currently in chapter 13 on page 410 and I am a little confused:
It is mentioned that Kernels as an input to F.conv2d need to be rank4-tensors:
I quote from the page:
“We’ll see how to handle more than one channel later in this chapter. Kernels passed to
F.conv2d need to be rank-4 tensors:
[channels_in, features_out, rows, columns] .
edge_kernels is currently missing one of these. We need to tell PyTorch that the number of input channels in the kernel is one, which we can do by inserting an axis of size one (this is known as a unit axis ) in the first location, where the PyTorch docs show
in_channels is expected.”
I went to the documentation of F.conv2d from PyTorch but I am not finding, that the kernel should be this “format”? Can someone help me figure out, where I can find that the order is supposed to be like this for the kernel as an input to F.conv2d [channels_in, features_out, rows, columns]?
I am also super confused in the passage at the book:
After mentioning this as the goal order [channels_in, features_out, rows, columns], the 1 for “channels_in” gets added into torch.Size[4,3,3]) in the second position to torch.Size([4,1,3,3]) which is not the order as [channels_in, features_out, rows, columns] as I thought it was supposed to be.
The idea of conv2d in almost any deep learning framework is that you want your channels_in (input channel) which is basically the “thickness” of your input to be the same as the features_out of your previous layer. And one way to specify the thinness of your features_out is to “stack” more kernels on top of each other.
So you can think of [channel_in, features_out, rows, columns] as being “the depth of my current input, the depth of my output after applying the convolution operation, the width of my input, the height of my input”.
In your case since you are only dealing with a gray scale image your channels_in will be 1. If you had a color image your channels_in would be 3 for rgb. In gray scale you will have a kernel of size [1, k_w, k_h] where k_w, k_h is the width and height of the kernel that you specify respectively. In pytorch you can specify this using the size=[k_w, k_h] function parameter (careful, not talking about weight parameters here). Now in order to specify the thickness of the output you specify features_out to what you like e.g. 256. Then pytorch will stack your [1, k_w, k_h] kernel and apply the convolution operation on your input of size [1, x_w, x_h]. This will result in a output shape of [features_out, x_w -k_w + 1, x_h - k_h + 1] (assuming that stride, padding is 0).
Hope this helps you understand the main idea behind the shapes involved in a convolution.
I just realized that you are using the functional API.
Please look at the following doc example: https://pytorch.org/docs/stable/nn.functional.html#conv2d
The weight parameter is what your kernel shape has to be in order for the convolution to work. (“Group” in your case is 1 i.e. don’t think about).
What I explained previously was assuming the modules API since that was the link you were referring to.
Hi mar, thank you very much for your detailed reply. You explained the respective parts very well. I didn’t really think that I had a problem there, nonetheless you made it again a little clearer for me .
Unfortunately I am still confused about the order that is mentioned in the book: I was expecting a torch.Size([1,4,3,3]) instead of torch.Size([4,1,3,3]). Also because of MNIST using only one channel. That’s just why the order in torch.Size([4,1,3,3]) doesn’t really make much sense to me.
Regarding the documentation: Isn’t there the order listed in “weight” below “parameters”:?
So is torch.Size([4,1,3,3]) equal to weight (out_channels, in_channels/group, kH, kW)?
If so, why was there still in the book the order of [channel_in, features_out, rows, columns] given?
(I hope I could explain well enough what I find confusing .)
Realized now after my post that you also referred a little to the “weight” parameter .^^
I am glade that I could clarify a little for you.
In terms of what shape you need to pass into the kernel if you are using the functional API it would be exactly what you wrote.
This could mean that the book is not following the documentation. This is normal since documentation is updated more frequently - and therefore what I would recommend you to follow.
However if you look at what I highlighted in the picture below which was taken from fastbook (13_convolutions), you will notice that the dimensions match the docs.