I’m sure this is really basic and just some terrible gap in my maths learning. But every time I hear something described as a linear or non-linear I get a little nervous. For whatever reason, I just don’t have a feel for why something would be either one and/or what that implies.

I’ve looked at some of the basic linear topics on Khan academy and it seems for basic examples it really means it graphs as a straight line. But when we move to higher-dimension I start to lose my intuition.

And then there are situations that linear or non-linear didn’t even enter my thinking. For example while talking about the activation layers and relu-softmax-drop out (I can’t remember exactly which lesson) he mentions that it adds a non-linear layer. I assume that means that the convolution or pooling layers are linear and the activation layers aren’t. But what does that mean? What subtle (ore maybe not so subtle) context or implication is that to set for me?

Some basic reading or exercises I should pursue to improve my instincts in this area? Or any explanation that cleared it up for all y’alls? Thanks for any help.

Without a non-linear layer (and ignoring bias), dense layers are really just matrix multiplication by the learned weights (W).

This means that if you stack multiple dense layers with linear activations and no bias, you could represent all of the layers as a single dense layer (W = W1W2W3…).

Since all of those layers can be represented as a single linear transformation, you will never be able to separate classes that can’t be separated by a line (e.g. checkerboard).

Try playing around with http://playground.tensorflow.org to get a sense of what adding non-linearities allows you to do with the data. (Theoretically can fit any arbitrary function.)

In case you haven’t seen it, there’s a really nice series of videos on YouTube by 3Blue1Brown. I think the one about linear transformations does a great job connecting matrix multiplication with linear transformations with a visualization. Though it’s only in two dimensions, I think the idea of the structure imposed by linearity can be extended to higher dimensions rather intuitively.

Another resource for a high level overview is the Master Algorithm by Pedro Domingos, which traces the history of the perceptron and neural networks in chapter 4. The author discusses how the perceptron was criticized by Marvin Minksy due to its inability to model a simple logical XOR function. It was known that a network of perceptrons would work, but this led to the credit assignment problem. Later there is a description of the sigmoid function, which describes some of the limitations of linearity in the model.