We all know the advantages and disadvantages of ReLU with respect to other popular nonlinearities like sigmoid, tanh, etc.
What I struggle to understand is its effectivenes in allowing a MLP to approximate nonlinear functions and separating nonlinear regions.
Relu is, in the end, the most trivial linear function (x, which leaves its input untouched) glued with the constant function 0.
Restricting ourselves to a single neuron, it just leaves the result of a dot product as it is, or suppresses it altogether if it’s negative.
How can relu be a useful nonlinearity? After all, we know that a NN with just linear activations (even the most general mx+q with m,q varying for each layer or even each neuron) would not be capable of separating nonlinear regions (any composition of linear mappings, no matter how long, is just a linear mapping).
And BTW places where sigmoid and tanh are highly nonlinear are associated with gradient explosion/vanishing gradients. So they are non-linear, but not quite.