I’m a bit confused with the spreadsheet. If the input layer is a 3d tensor, (3x28x28) then how are we representing it on a spreadsheet with 2 dimensions (28x28)? Are the numbers we see for the input layer maybe the average of the 3 layers (RGB) values?
Sorry not particularly good with math. I feel like I’m missing something simple.
Thanks for your help!
Edit: So I guess there are no RGB’s in the spreadsheet because it’s greyscale? In any case, how do we convolve on all 3 layers, does the kernel get convolved on each layer and then the average or sum is taken?
You’re right, in the spreadsheet example there is only one channel. When there are three channels, there is one kernel per channel per number of desired output kernels, and after convolution their sum is taken (and any bias added). For a detailed description, along with a nice visualization showing this process, make sure to check out the cs231n page.
Ahh, Thank you z0k! That’s exactly what I needed to know.
I’ve definitely been pouring through that c231n page a ton as well as other resources, they all seem to split the convolution by slice though and I guess I was misunderstanding what was happening with the depth layer. I get it now tho, appreciate it!
Can I ask another question?
So I understand the mathematical process is summing of the elementwise multiplication of the kernel and a given layer, summed with the other layers + bias. What I am not sure about is how this maintains the individual values of RGB, wouldn’t summing it result in a loss of which colors held which values? How then does the computer learn to represent each color or combination of colors?
Each channel has its own kernel, and gradients are computed for individual weights via backpropagation. It’s probably worthwhile to work through a simple example of backprop to get a sense for how it works.
If you are interested in how it works specifically within the context of CNNs, here is a nice blog post that might be helpful. Although once you see how the process works in a simple example, I’m not sure there’s that much additional insight to be gained for your question by looking at a convolutional network example.