Help with intuition on the output when stacking convolutional layers

Take for example this sample code from the keras documentation:

# apply a 3x3 convolution with 64 output filters on a 256x256 image:
model = Sequential()
model.add(Convolution2D(64, 3, 3, border_mode='same', input_shape=(3, 256, 256)))
# now model.output_shape == (None, 64, 256, 256)

# add a 3x3 convolution on top, with 32 output filters:
model.add(Convolution2D(32, 3, 3, border_mode='same'))
# now model.output_shape == (None, 32, 256, 256)

Why is the output from the 2nd convolutional layer 32?

I know I’m missing something, but I understand from the lectures that the first convolutional layer is essentially creating 64 representations of each image (one for each of the 64 filters). Are these 64 representations each fed into the 2nd convolutional layer one at a time so that each of those images produces 32 representations of it?

Yes, that’s exactly it. Your second convolutional layer will take the 64 channels of the previous layer’s output and create a 32 channel output (i.e. 32 representations of the image). There will be 32 filters in your second layer, and each filter will be (64x3x3).

So the dimensions of the output images will be (32x256x256).

EDIT: Actually, just realized you might be suggesting that there would be 64 x 32 output channels. Each output channel will take information from all of the channels from the previous output with a normal convolutional layer.

Thanks for the reply!

Your comment and the beginning of the Lesson 4 video clarified these concepts for me. In particular, it is thinking about these filters as 3-dimensional objects (3x3x<# of filters>) instead of merely two dimensional 3x3 matrices.

Here is my understanding:

Layer 1
Receives as input a bunch of 256x256 images with 3 layers (the 3 RGB channels). It is helpful to think of these images as 3 dimensional as well (256x256x3).

The layer is configured to apply 64 3x3 filters to these images, but again it is helpful to think about the 3x3 filters as 3 dimensional (e.g., as 3x3x<#of filters from previous output>). So in reality, the 64 filters are 3x3x3 cubes or like a stack of three 3x3 matrices.

The three layers (matrices) that make up each filter are convoluted with their corresponding layer in the input, and summed to produce a new representation of the image. As there are 64 of these filters, there will be 64 representations.

The number of parameters will be 6433*3 (the last 3 is the # of filters/channels from the input).

Layer 2
Receives as input a bunch of 256x256 images each with 64 layers (or representations). Again, its helpful to think of them as a 3 dimensional object as well (256x256x64).

This layer is configured to apply 32 3x3 filters to these images, but to account for the 64 layers as input, it is better to think of these filters as being 3x3x64.

The 64 layers (matrices) that make up each of these 32 filters are convoluted with their corresponding layer in the input, and summed to produce a new representation of the image. As there are 32 of these filters, there will be 32 representations.

The number of parameters will be 3233*64 (the 64 is the # of filters/channels from the input)

As forward/back prop happens over and over, the 64 3x3(x3) filters in Layer 1 and the 32 3x3(x64) filters are trained and begin to take on different characteristics based on helping the model predict the correct classes.

ANOTHER QUESTION: How do the later convolutional filters “find even more complex structures of the image” (from lesson 3 notes)? It would seem that with the images getting downsized with all the max-pooling that what you get from these filters would be less specific and less complex. I know this isn’t the case, but I don’t know why that is. For example, at the end of VGG we have 7x7 images with 512 channels (representations) that can find very specific things … how is that given the images are on 7x7 pixels?



Usually as the resolution decreases, the depth of the layers increases, so the information content is similar. And if you have a filter that gets activated by eyes and another that looks for noses, after a pool the next filter might recognize faces. You don’t need the full resolution of the original image once you can identify features like eyes to identify a face.

There are some good visualization packages out there that can help you get a better sense, IIRC Jeremy showed this video example in one of his lectures:

There’s also this visualization package that works with Keras (looks cool but I haven’t tried it yet):

Thanks for this! I was abit confused about it.

So that I’m clear what is going on, are you saying that when we add a 3x3 convolutional layer, we are actually doing a 3x3xA convolution, where A is the number of the previous layers convolutional filters? And since we are doing a 3x3xA convolution, we apply to same rules to convolution but in 3D. That is, we multiple the elements in the original image by the weights in our convolution kernel and then sum the total to produce a single number which we add to that particular filters output. Such each filter’s output is now a 2D matrix, when we concat with the other filters in the layer it becomes 3D again, but this time with value B which is the number of convolutions in the new layer.

Yep, that’s it, although there are variations.

F. Chollet (creator of Keras) uses an interesting one in his paper from last fall - depthwise separable convolutions (combination of 1x1 and k x k convolutions on individual input channels). So instead of m * k * k * n trainable parameters you only have (m + k * k) * n parameters (m = input channels, n = output channels, k = kernel size).

(NB There’s actually an additional parameter for number of channels in the intermediate step, although default is number of output channels. )

It’s available as a layer in TensorFlow and in Keras (with TF backend only).


Yes you got it!

If you think about it, it make sense because how else would later convolutional layers be able to learn more complex features if they were not using the full output from previous layers?

What really drove this home for me was spending some quality time watching this video and reading this paper:


I took the entire course, and i recommend you to do so. It covers great fundamental details.

Thanks for starting this thread!!

I find convolution-intro.ipynb super helpful in understanding convolution, it should be fun to implement a 2 conv layer model from scratch like Jeremy did. And you’ll really get to see how things move around.