Hi! I’ve reached the end of Lesson 4,(it was a tough one, no doubts)
But when I got to the last part when we added non-linearity to our model, I felt a bit overwhelmed, with all those unknown concepts, and one of them (naturally the most important one) really got me, that concept was the concept of neural networks.
I started to think about what the whole concept look like, but I ended up with only one explanation for why we use multiple layers… because this way can optimize our parameters more, and more. But that doesn’t seem so sensible, I feel like I’m missing something, and that’s where ReLU comes into the picture, but I can’t see how an activation function separates the layers so that they can do their own work, so I came up with a(n) (another)theory: we change every negative number to zero, and then when we get to the second layer thanks to the biases we can optimize our parameters more and more But what happens when a parameter like that becomes positive(this way multiplying it with the pixel (in case of mnist(lesson 4 basically) will be positive) hence SGD will start to optimize this parameter to because it’s not zero anymore, then… why do we need a second layer?
I’m sorry for the long description, can someone tell me what are the mistakes in my theory? I’ve read the book many times, watched this 10 minutes over and over again, but I don’t understand.
Cheers!
[slightly edited to add missing “but” and footnote]
Hi @d0rs4n! Good question. It turns out these are mostly separate.
We need nonlinear because a combination of linear functions is just another linear function, and linear functions aren’t flexible enough.* ReLU is about the dumbest linear function you can get, but (a) it’s fast, (b) it does at least as well as the old sigmoids, and c) it’s actually a lot better, probably due to a combination of sparsity and constant gradients.
We need layers to make things tractable. A single layer network has all the theoretical power of a deep net, but it needs to be exponentially wide to do so. Making the network deep allows the layers to abstract. Biological visual systems use the same trick.
Abstraction means more regularity and fewer parameters, so it’s both faster and more robust.
The magic of deep nets really lies in the automatic feature selection, and that’s driven by the deep structure. Give a regression those same 2,048+ extracted features, and it will often do nearly as well.
* Not flexible enough: SVMs are linear, but they nonlinearly-transform the data into a higher-dimensional space.
Thank you for your answer! It’s very comprehensive!
But I still have a few questions. When we turn negative numbers to zeros, won’t it… make the gradient for this parameter zero? Furthermore, in the second layer this zero will become … the bias(?), and SGD can optimize it more and more, but I’m still not comfortable with the definition of layers, how they interact with each other. Basically the thing I see, is like… two layers of weights(in case of a two layer neural net.), that are both optimized by SGD, but at the end the same thing happens, we multiply our parameters with the weights and add the bias, and do the same with the other layers of the weights too, on the result set. Probably that’s the place where I have a lack of understanding, since I can see that between those two operations ReLU will make every “unimportant” weights, well… zero, and here comes my first question
I hope you can follow along with my problem, and thank you for your kind help!
My answer didn’t quite work for you, so next take a look at similar discussion from other writers first to see if they help:
- ReLU unreasonably effective! I think this one is closest to your question. Apparently the best answer was “Handout 3B.” Good band name.
- Why can’t ReLU [over]fit my sine wave? Has an insightful example/discussion.
- Why ReLU again?
- Are vanishing gradients good? Or… why did this work at all?
- Google says Swish beats ReLU. (2017, so may not have held up?? But interesting.)
Let us know where you are after looking at a couple of those?
I’ve read all of them, also found some videos, but the thing is, that I might have problems with understanding something else.
Here, I’m not sure I understand how is it possible that the first layer constructs 30 features, what are those features?(The book hasn’t mentioned those before…) I mean in case of a 28*28 image you multiply the pixels with the weights add the bias, and that’s it, in the previous example in chapter 4, we simply summed this matrix(the matrix after the multiplication), and used a sigmoid function! (Strange, because I can imagine why we have to use sigmoid… (because the loss function)) And that was the prediction, but what does the first layer constructs in this case?
I think, that was the problem, I just didn’t realize until now.