Why do we use ReLUs

devforfu · May 10, 2022, 9:04am

Even though I’m aware of ReLU for a while and know its mathematical representation, I am still kind of struggling to understand, or at least, have some intuition why it cuts off the negative part of the domain, and still is works as a (universal) approximation? Is it because of combination of multiple linear layers interleaved with ReLUs, and there we have negative values? Like pre-activations?

ilovescience · May 10, 2022, 9:13am

I personally found this animation very helpful for visually understanding this:
https://nnfs.io/mvp/

arunslb123 · May 10, 2022, 9:18am

Another great resource for visual understanding that neural nets can arbitrarily compute any function Neural networks and deep learning

SamFogarty · May 10, 2022, 9:21am

The ReLU is able to effectively discard some data as noise, if the input value falls below its threshold - so the network can throw away signal to not propagate further. The network learns representations then that allow it to discard noise. This importantly is in contrast to a type of function that can only ‘smoothly’ vary - a linear function - which cannot represent sharp breaks easily.
The main feature of a stack of ReLUs that allows it to represent complexity then is this non-linear nature.

asharma · May 10, 2022, 9:24am

One of the ways to think about it could be is that ReLU adds non-linearities to the computation.
All other computations in Neural nets are of the form y = wx+b . If we didnt add a ReLU or another such function we will be limiting overselves to approximation of linear functions only

Hope that helps. Will share some more intuitive posts if I come across any

stantonius · May 10, 2022, 9:25am

So I (now) understand the output of many ReLUs together can approximate any complex function. What I don’t get is why, for each individual ReLU, do we effectively discard half the input values to the ReLU and set them to 0. It seems “wasteful” at a micro level, but at the macro level it makes perfect(-ish) sense.

Does this level of detail even matter?

SamFogarty · May 10, 2022, 9:32am

The neural network must compress a whole picture into a signal class number. The model becomes like a smart zip function - compressing data into smaller representations. It needs to be able to discard data that is less important than other signal. Consider that some pixels in an image may be much more relevant than others. So being able to ‘discard’ data is part of the solution for classification.

stantonius · May 10, 2022, 9:33am

I see this is similar to @devforfu question so nevermind. Although I’m still not sure I get it

mike.moloch · May 10, 2022, 1:24pm

In my limited understanding, ReLU is computationally “cheap” and it helps control the vanishing gradient problem you might find in other activation functions like sigmoid. Andrew Ng goes into detail about this one of his lectures which are rather math heavy (Coursera deep learning specialization)

suvash · May 10, 2022, 2:25pm

If you’re thinking like a programmer, it clicked for me when I could see that ReLU acts like a conditional If statement.

If positive,
then activation passed through,
else not.

Now given all the ReLUs in a network, and there’s a lot of conditional stuff the network can be trained to do. That’s one way to build intuition about it.

suvash · May 10, 2022, 2:45pm

This is a pretty good post. Thanks for sharing !

devforfu · May 10, 2022, 2:45pm

The ReLU discussion got quite some momentum! Thank you all for the comments and cool links! (Sorry that it somewhat distracted from the main lecture’s topic. Maybe we can move it to a separate section if you think it is out of place here.)

I would say that in general, the concept of degrees of freedom and universal approximation is what I use when thinking about multi-layer models. Like, non-linearity between the layers gives you enough flexibility to approximate very complex hyper-planes. As otherwise, you just have a series of matrix multiplications that is the same as just to have one resulting matrix. (As was highlighted in the lecture.)

But somehow, to me, functions like sigmoid and tanh seem to be more “intuitive”, i.e., like saturation clips that prevent your activations from going wild. And these days, ReLU (and its variants) is a standard approach, except maybe RNN-like architectures. So obviously, they’ve better optimization properties even though they have unbounded linear part and cut off negative values.

Sounds reasonable! So essentially, we don’t care about these negative values. It reminds me dropout regularization when we ignore some connections completely to get rid of the noise.

Yeah, a good point! In that sense, they also remind me of transistors, that are opened/closed depending on the current at their base

P.S. feels like I should read some papers about activations and their properties…

jeremy · May 10, 2022, 10:25pm

Not necessarily! In fact the best compression algorithms nowadays in the research literature use exactly the approach that @SamFogarty described. It turns out that compression and neural net learning are kinda two sides of the same coin! (This is even more true when you consider lossy compression…)

jeremy · May 10, 2022, 10:32pm

Yup that’s how LZW works - but you can do lossless compression using a neural net too! It doesn’t have to be a table.

Here’s a really interesting project for one aspect of that:

jeremy · May 10, 2022, 10:40pm

Bart, Sam is one of our TA’s, and his response to the student’s question was accurate and appropriate. An NN is, indeed, like a “smart zip”. I think the details of how a traditional (“non smart”) zip works could be distracting to this explanation, so I’d rather we didn’t go further into those weeds.

Interogativ · May 10, 2022, 10:41pm

Point taken, my concern was that the new student might get confused. I’ll gladly delete my post if you wish.

jeremy · May 10, 2022, 10:42pm

It’s all good

Interogativ · May 10, 2022, 11:10pm

Huttner Prize is a very interesting take on large text compression, I wish I’d known about this two years ago (when I would’ve had time to pursue it). Still this thing looks ripe for NN and perhaps transformers.

jeremy · May 11, 2022, 1:05am

Good idea - I’ve done that now.