# Lesson 4- Neural Networks Question

Hi! I’ve reached the end of Lesson 4,(it was a tough one, no doubts)
But when I got to the last part when we added non-linearity to our model, I felt a bit overwhelmed, with all those unknown concepts, and one of them (naturally the most important one) really got me, that concept was the concept of neural networks.
I started to think about what the whole concept look like, but I ended up with only one explanation for why we use multiple layers… because this way can optimize our parameters more, and more. But that doesn’t seem so sensible, I feel like I’m missing something, and that’s where ReLU comes into the picture, but I can’t see how an activation function separates the layers so that they can do their own work, so I came up with a(n) (another)theory: we change every negative number to zero, and then when we get to the second layer thanks to the biases we can optimize our parameters more and more But what happens when a parameter like that becomes positive(this way multiplying it with the pixel (in case of mnist(lesson 4 basically) will be positive) hence SGD will start to optimize this parameter to because it’s not zero anymore, then… why do we need a second layer?
I’m sorry for the long description, can someone tell me what are the mistakes in my theory? I’ve read the book many times, watched this 10 minutes over and over again, but I don’t understand.
Cheers!

[slightly edited to add missing “but” and footnote]

Hi @d0rs4n! Good question. It turns out these are mostly separate.

We need nonlinear because a combination of linear functions is just another linear function, and linear functions aren’t flexible enough.* ReLU is about the dumbest linear function you can get, but (a) it’s fast, (b) it does at least as well as the old sigmoids, and c) it’s actually a lot better, probably due to a combination of sparsity and constant gradients.

We need layers to make things tractable. A single layer network has all the theoretical power of a deep net, but it needs to be exponentially wide to do so. Making the network deep allows the layers to abstract. Biological visual systems use the same trick.

Abstraction means more regularity and fewer parameters, so it’s both faster and more robust.

The magic of deep nets really lies in the automatic feature selection, and that’s driven by the deep structure. Give a regression those same 2,048+ extracted features, and it will often do nearly as well.

* Not flexible enough: SVMs are linear, but they nonlinearly-transform the data into a higher-dimensional space.

2 Likes