From the book chapters, I know that ReLU takes the max of a number or 0. Does that imply that an activation which is less than zero will never make its way from the output of one linear layer to the input of the next?
If so, it seems surprising that ReLU would be a safe choice for a non-linear layer. If our activations haven’t yet reached the model’s final layer (where we’d normalize all the values with softmax or something), an activation value of less than zero still seems like a valid value. Therefore, intentionally zero’ing out an activation strikes me as a “lossy” action for us to take.
Put another way, I would have thought that the time to ensure our activations fall within a certain range of valid values would be at the very final layer, when we call “sigmoid()” or “softmax()” or something else, in order to ensure that the activations map to prediction percentages which sum up to 1. At that point, it makes sense to me that we wouldn’t want a negative activation (because how can you have a prediction which is less than zero?). But until we get to that point, it strikes me as premature to zero out our activations, for the reasons I described.
I understand the necessity of non-linear layers in general, i.e. they’re needed to help our model conform to the universal approximation theorem (otherwise, a series of linear layers with no interleaving non-linearities could be reduced into a single linear layer, and if you only have one layer then you’re not taking full advantage of a model’s ability to learn and improve through training). I’m just confused on how using ReLU isn’t considered “lossy”.
I feel like there must be something I’m mis-understanding about what it means when an activation has a negative value.