Chapter 4 - Why do we use ReLU the way we do?

alonsun · June 21, 2022, 5:41am

Hi All, I have 2 questions about ReLU:

Why would we want to zero out negative numbers, and not give them other negative or positive values? res.max(tensor(-1.0)) or res.max(tensor(1.0))?
Why we choose a max() function? what other alternatives have been tried?

Thank you for answering

zonkyo · June 21, 2022, 6:23am

Hi!

So, the second question is easy to answer.
Until ReLu became popular, most of the neural networks either used tanh or a sigmoid function; neither of these is linear, both of these use exponential functions which means that calculating them and their gradients is more cost intensive than using a (partially) linear function. So, there have been other functions in use. Since these had the problem of saturating, neural networks were not that deep. ReLu does not have that problem and works well with SGD, so it was a good fit (came around 2009 I read? not sure about this).
There are also a lot of other function based on ReLu (Leaky ReLu, Parametric ReLu) which are either also partially or piecewise linear functions; but there are a lot of other functions as well (Rectifier (neural networks) - Wikipedia )

This should also answer 1:
If you look at the graphs corresponding to some of the other methods used, you can see that you do not have to set 0 for negative inputs. But part of the question was “why not just set positive values” most likely is answered with the following comment I read once on stackoverflow (keras - Why does almost every Activation Function Saturate at Negative Input Values in a Neural Network - Stack Overflow)

The answer is an activation function doesn’t need to ‘predict’ a negative value. The point of the activation function is not to give an equation to predict your final value, but to give a non-linearity to your neural network in the middle layers. You then use some appropriate function at the last layer to get the wanted output values. ex) softmax for classification, just linear for regression.