Earlier this year, I took Andrew Ng’s Machine Learning Course which is an excellent introduction to machine learning.

The general architecture Ng describes for a NN is essentially all layers are “fully-connected.” Calculating the activations for the nodes of each layer involves multiplying the weights for that node with all nodes of the input layer, summing them, and applying the logistic regression function to that sum.

So, my question is, was that a substantial oversimplification of the way NNs work? It seems like in this class there are two major differences from Ng’s model:

Not all layers are fully connected. In fact, most of the layers in a CNN are not fully connected and, in fact, nodes in most layers (the “convolutional layers,” for example) are only connected to some set of spatially related nodes.

The non-linear function (e.g., Relu) is separated into a separate layer from the linear function. So the layers look something like Conv -> Relu -> Conv -> Relu -> Pool -> Dense. In Ng’s model, the linear and non-linear layers are sort of combined into one since you apply the logistic regression function to the linear combination and call that result the new layer.

Do I understand that correctly?

Related, if I recall correctly, Jeremy mentioned in class that the “activation function” refers to the non-linear function applied to the conv kernel. Just so I understand the nomenclature correctly: the conv kernel is just called a kernel, but then you apply an activation function to it. Is that right? Maxpool and Relu would be examples of activation functions, correct?

I think how you slice and dice a neural network is not that important - it’s all just semantics.

Activations is what you get when you apply the nonlinearity to what preceeds it. The nonlinear functions are (among other): tanh, sigmoid, relu and so on.

Conv layers - layers that have those little sliding windows across whatever preceeds them. Those little windows are called kernels but can go by different names also I believe. In effect, this is the trainable part. A ‘window’ could be a 3 x 3 matrix of weights that you apply to each 3 x 3 section of the previous layer as you move it across.

To be honest, I wouldn’t worry too much about the naming - I think it will all click in place as we go and I don’t think we got around to discussing the layers in details just yet From the course outline that @jeremy presented I think this is going to happen down the road when we revisit CNNs.

Probably there is some really great material linked to from the wiki’s for each lesson - so I would suggest you take a look at this first. The reading that comes to my mind that could maybe clear some of the confusion for you would be CS231n notes but I am quite sure the wikis should contain material that is even better.

One additional question on the excel sheet you discussed in our Week 3 class (conv-example.xlsx).

You are applying two convolutional kernels to the input layer. Does this means the first hidden layer has (almost) double the number of nodes as the input layer? Is that an artifact of your choice not to represent the input layer with 3 RGB channels, since that would be infeasible in excel. If you had, would the first hidden layer actually have fewer nodes than the input layer?

To frame my question another way, although the input layer in an image processing network might have 3 color channels, the first hidden layer would typically drop explicit color channels, correct?

To get a little further into the details, in the fastai/pytorch framework, if you have let’s say a 224 x 224 x 3 photo, is that represented as a 3-dimensional numpy array, or is it unrolled into a 150528 x 1 array?

Yes, you are right! But we don’t unroll the first tensor. A tensor just just a multidimensional array - so the input consists of 3 channels each consisting of 224 x 224 values.

When we slide the window, the window actually utilizes information from all the channels in the previous layer! So our little 3x3 window is actually a 3x3x3 tensor! It combines the information from all three layers to produce a piece of the following layer. One often used set of settings for the convolution operation is such that a single 3x3x3 window (a 3 dimensional window!) will produce a 1x224x224 output. Now… we want to be able to detect all sort of edges / gradients / shapes, so we need a lot of such little windows each specializing in detection of something else (in the past they were referred to as detectors - for instance, edge detectors - and they were handcrafted!). So, we get an arbitrary number of such sliding windows AKA kernels.

We get 32 kernels - we produce next layer of dimensions 32x224x224. We can do various things to it, but we could run the convolutions again on it! Say, we now want another set of 32 sliding windows. You could say I want my first sliding window to only look at dimension 1 through 4 for example, but this is not usually done. The little sliding windows are now going to be… yes! 32x3x3. And rinse and repeat to combine lower level features (edges / dots / etc) into higher level shapes (faces / something that looks like writing / and so on)

That is nearly all perfectly correct! The only very minor correction is that maxpool isn’t an activation function, since it’s not a 1-1 mapping (there are less activations coming out of a pooling layer than there are going in). But that’s a fairly technical detail.

We’ll be looking at fully connected networks in the next class.