ReLU and its effectiveness

We all know the advantages and disadvantages of ReLU with respect to other popular nonlinearities like sigmoid, tanh, etc.

What I struggle to understand is its effectivenes in allowing a MLP to approximate nonlinear functions and separating nonlinear regions.

Relu is, in the end, the most trivial linear function (x, which leaves its input untouched) glued with the constant function 0.
Restricting ourselves to a single neuron, it just leaves the result of a dot product as it is, or suppresses it altogether if it’s negative.

How can relu be a useful nonlinearity? After all, we know that a NN with just linear activations (even the most general mx+q with m,q varying for each layer or even each neuron) would not be capable of separating nonlinear regions (any composition of linear mappings, no matter how long, is just a linear mapping).


Maybe this will help you understand it. He uses a step function as a non-linear function but relu would work equally.


You should play with

It will give you a good intuition of how the different activations work.

1 Like

Useful links. But I would have hoped for something a bit more theoretically grounded…

Here you go:

A good theoretical paper that shows that neural networks are piecewise linear and because of this are susceptible to adversarial examples.


And BTW places where sigmoid and tanh are highly nonlinear are associated with gradient explosion/vanishing gradients. So they are non-linear, but not quite.

Thanks, I’m sure I’ll enjoy it. :slight_smile:

check the lecture number 3 on MLP slide #7 you’ll find a visualization for how relu is able to approximate the nonlinearity

hope it helps.


1 Like

Didn’t know that course. Thanks, I think it will be interesting for other stuff too…

EDIT: Ok, found it on handout 3B, it is like I imagined it, but it provided duly justification.

Thanks! That answers my question!