Why do we use ReLUs

Even though I’m aware of ReLU for a while and know its mathematical representation, I am still kind of struggling to understand, or at least, have some intuition why it cuts off the negative part of the domain, and still is works as a (universal) approximation? Is it because of combination of multiple linear layers interleaved with ReLUs, and there we have negative values? Like pre-activations?


I personally found this animation very helpful for visually understanding this:


Another great resource for visual understanding that neural nets can arbitrarily compute any function Neural networks and deep learning


The ReLU is able to effectively discard some data as noise, if the input value falls below its threshold - so the network can throw away signal to not propagate further. The network learns representations then that allow it to discard noise. This importantly is in contrast to a type of function that can only ‘smoothly’ vary - a linear function - which cannot represent sharp breaks easily.
The main feature of a stack of ReLUs that allows it to represent complexity then is this non-linear nature.


One of the ways to think about it could be is that ReLU adds non-linearities to the computation.
All other computations in Neural nets are of the form y = wx+b . If we didnt add a ReLU or another such function we will be limiting overselves to approximation of linear functions only

Hope that helps. Will share some more intuitive posts if I come across any


So I (now) understand the output of many ReLUs together can approximate any complex function. What I don’t get is why, for each individual ReLU, do we effectively discard half the input values to the ReLU and set them to 0. It seems “wasteful” at a micro level, but at the macro level it makes perfect(-ish) sense.

Does this level of detail even matter?

1 Like

The neural network must compress a whole picture into a signal class number. The model becomes like a smart zip function - compressing data into smaller representations. It needs to be able to discard data that is less important than other signal. Consider that some pixels in an image may be much more relevant than others. So being able to ‘discard’ data is part of the solution for classification.


I see this is similar to @devforfu question so nevermind. Although I’m still not sure I get it :sweat_smile:

1 Like

In my limited understanding, ReLU is computationally “cheap” and it helps control the vanishing gradient problem you might find in other activation functions like sigmoid. Andrew Ng goes into detail about this one of his lectures which are rather math heavy (Coursera deep learning specialization)

1 Like

If you’re thinking like a programmer, it clicked for me when I could see that ReLU acts like a conditional If statement.

If positive,
then activation passed through,
else not.

Now given all the ReLUs in a network, and there’s a lot of conditional stuff the network can be trained to do. That’s one way to build intuition about it.


This is a pretty good post. Thanks for sharing ! :raised_hands:

1 Like

The ReLU discussion got quite some momentum! Thank you all for the comments and cool links! (Sorry that it somewhat distracted from the main lecture’s topic. Maybe we can move it to a separate section if you think it is out of place here.)

I would say that in general, the concept of degrees of freedom and universal approximation is what I use when thinking about multi-layer models. Like, non-linearity between the layers gives you enough flexibility to approximate very complex hyper-planes. As otherwise, you just have a series of matrix multiplications that is the same as just to have one resulting matrix. (As was highlighted in the lecture.)

But somehow, to me, functions like sigmoid and tanh seem to be more “intuitive”, i.e., like saturation clips that prevent your activations from going wild. And these days, ReLU (and its variants) is a standard approach, except maybe RNN-like architectures. So obviously, they’ve better optimization properties even though they have unbounded linear part and cut off negative values.

Sounds reasonable! So essentially, we don’t care about these negative values. It reminds me dropout regularization when we ignore some connections completely to get rid of the noise.

Yeah, a good point! In that sense, they also remind me of transistors, that are opened/closed depending on the current at their base :smile:

P.S. feels like I should read some papers about activations and their properties…


Not necessarily! In fact the best compression algorithms nowadays in the research literature use exactly the approach that @SamFogarty described. It turns out that compression and neural net learning are kinda two sides of the same coin! (This is even more true when you consider lossy compression…)

1 Like

Yup that’s how LZW works - but you can do lossless compression using a neural net too! It doesn’t have to be a table.

Here’s a really interesting project for one aspect of that:

Bart, Sam is one of our TA’s, and his response to the student’s question was accurate and appropriate. An NN is, indeed, like a “smart zip”. I think the details of how a traditional (“non smart”) zip works could be distracting to this explanation, so I’d rather we didn’t go further into those weeds.

1 Like

Point taken, my concern was that the new student might get confused. I’ll gladly delete my post if you wish.

It’s all good :slight_smile:

1 Like

Huttner Prize is a very interesting take on large text compression, I wish I’d known about this two years ago (when I would’ve had time to pursue it). Still this thing looks ripe for NN and perhaps transformers.

Good idea - I’ve done that now.

1 Like