At the end of chapter 4 it adds the ReLU function between linear layers, I understood the reason behind that, adding non-linearity, but, why the ReLU? why change negative numbers to 0? why not -1? or -10, or -0.1, or 4.5, why 0?

and, resetting negative numbers to 0s isn’t like messing up with the parameters calculated in the previous linear layer? if my model needs negative parameters to work it will never work… I’am confused, I’ll re-read the chapter

Here’s a first reply - others can improve it I’m sure.

ReLU is h=max(0, a). The two main advantages of ReLU – speed and non vanishing gradients – still hold for variants like h=max(-10, a), or h=max(25, a). For those purposes I suspect it doesn’t matter at all - changing the inflection just shifts the weights around. (a = Wx + b is linear)

However, choosing 0 means your matrices are sparse, (a) which should reduce overfitting, and (b) can allow performance gains - if the hardware & algorithms support it. I’d be happy to hear more on this from people who know more.

I am stuck around the same place and came here to look for some answers.

I believe this question doesn’t have a straight satisfying answer.

I have found this topic which explains the situation quite well. Also probably you have already seen but you can take a look at this wikipedia page. None of them says an exact thing as to why ReLU does what it does there but you can get gain some insight at least.

Those other suggestions are generally called “leaky relu.”

There are more activation functions out there, some can perform better. Relu was by far the most influential one though, and the most important to understand.

Major deviations from previous solutions included the sharp non-linearity near 0, and having an unbounded maximum value. Historically activations like tanh were used before, so compare that to relu. Tanh had two issues, performed like no activation near 0, or it basically made all your numbers 1 or -1 when they got stuck at the top or bottom far from 0.