Hoping to get some thoughts on this.
So we take an image with shape (224, 224, 3) ( or 3, 244, 244 for Theano) and chuck it through a Conv layer with 64 filters, each 3x3x3. Assuming padding was used, this will leave us with 64x224x224 output.
So you’ll notice we entered with 224x224x3 and have output 64x224x224. Another way to think about this is we’ve taken a colored image and turned it black and white, as shown here: https://cs231n.github.io/assets/cnn/convnet.jpeg
I understand that this is in the nature of Conv networks, but it doesn’t have to be. instead of having 1 number output by each filter for each convolution, it could be 3–one for each channel. I understand that this would leave us with more overall parameters, but perhaps we could decrease complexity elsewhere to make up for it?
Or is the thought process that the CNN will learn the weights appropriate to this decolorization? This just seems unnecessary, and maintaining the color channel would give greater depth of understanding.
When 224x224x3 gets turned into 224x224x64 the number of channels actually increases (from 3 to 64). It’s not throwing away any information but creating new information.
If a convolution filter only wanted to keep the red channel, it could do so. Or only the green channel, or only the blue channel. Or any combination of these channels. There is nothing in the math stopping this from happening.
If it makes sense for the network to learn convolution filters that keep the color information, then it will learn this.
But I feel as though this could just as easily be 64x224x224x3 instead of dropping the 3. Can we definitively say this would not be helpful? I may be completely off the mark here, but it seems like we’re losing information by changing this from 3 dimensions down to 1.
It is true that the 3 channels get “squashed down” into a single channel, and this is repeated 64 times using different filters. So first there are 3 channels, then a convolution is applied, and then there are 64 channels.
But since 64 > 3, the output (64 channels) has more capacity to represent stuff than the input (3 channels). If we were to go from, say, 128 channels down to 64, then you lose information because 64 channels only hold half the data. But here we go from fewer channels to more, and the only information we lose is information the neural net does not consider important.
If the neural net wanted to keep the exact input image for some reason, it could learn convolution filters that kept channels 0, 1, and 2 the same as the input image (i.e. RGB) and do other stuff with channels 3 through 63. In this case, if you plot the first 3 of those 64 channels you’d see the original image again. (This should show you that there is no loss of information here.)
By the way, there is such a thing called a “depthwise” convolution that does not squash down the input channels. So if there are 3 input channels there are also 3 output channels. You can also supply a “channel multiplier”, so if you set this multiplier to 64, then each input channel has 64 filters applied to it independently of the other input channels, and you end up with your 224x224x64x3 image (although in Keras this would show up as 224x224x192).
I feel like the key thing you’re missing is that the 3 dimensions of RGB are just a mapping of colour. As you step through the CNN you’re moving from a colour/2d image space to a higher dimensional concept space that starts out simple, mapping things like edges and moves on to patterns and combinations of patterns as the CNN gets deeper. You don’t lose any information in that mapping because you’re going from a low dimensional space to a much higher dimensional one.