Understanding how convolution is applied to multiple channels


I am currently going through Part-1 old lectures. In Lesson 4 (video: https://youtu.be/V2h3IOBDvrA?t=252), convolution is explained in a spread sheet. Here, Jeremy explains that for multi-channeled input, filtering is not a 2D matrix but rather a tensor. But, in the VGG code & Convolution2D function documentation, I see it doesn’t take tensor but a normal 2D filter. Even when there are multiple channels for an image.

Below is my understanding that fits the convolution flow. I would like to know if it is correct. Or if I am missing something here:

A colored image (cat or a dog) will have its input size as 3x224x224. Here 224x224 is the pixel dimension. 3 is number of channels - red, green, blue.

When we pass this image to the a convolution layer - ‘Convolution2D(64, 3, 3, activation=‘relu’)’, we are applying 64 filters each of size 3x3 on each channel(?). So,

  • filter1 will have 3x3 matrix run over red channel (which is of size 224x224) & we will get an output of 224x224 (assuming zero padding)
  • Similarly, filter1 will also be run on blue & green channels to get 2 outputs of size 224x224 each.
  • Is my understanding till here correct? If so, will the output of all 3 channels somehow be merged into a single 224x224 convolution? Because, when we apply 64 filters, the output for each colored input image is 64 convolutions according to documentation of Convolution2D.
  • If above understanding is correct then, same process is repeated for filter2 to filter64 to get a total of 64 convolutions. Which would comply to the documentation.

Pls, let me know if this is how convolution works. And if not, what am I missing here?

1 Like

If you have 3 channels, I think you effectively have a 3x3x3 kernel so there are 27 trainable weights + a bias. So an application of a single kernel to all channels produces a single feature map. If you perform this operation with 10 different trainable kernels, you will get 10 feature maps.

The way this is implemented in most modern libraries AFAIK is that the convolution operation is by default performed on all the channels. If you add another layer of 2d convolutions, you will run whatever kernel size you pick on these 10 feature maps (channels) that the lower layer output. If 3x3 is your kernel size, each convolution operation will have 3x3x10 + 1 trainable weights.

Didn’t want to mislead you so did some checking and that seems to be the case :slight_smile:

1 Like

You should use the #part1 (2017) forum for the old lectures FYI.

Actually the naming between the parts has become confusing now…

Yes it is confusing - but I don’t want to break links to the old version.