From the book chapters, I know that ReLU takes the max of a number or 0. Does that imply that an activation which is less than zero will never make its way from the output of one linear layer to the input of the next?
If so, it seems surprising that ReLU would be a safe choice for a non-linear layer. If our activations haven’t yet reached the model’s final layer (where we’d normalize all the values with softmax or something), an activation value of less than zero still seems like a valid value. Therefore, intentionally zero’ing out an activation strikes me as a “lossy” action for us to take.
Put another way, I would have thought that the time to ensure our activations fall within a certain range of valid values would be at the very final layer, when we call “sigmoid()” or “softmax()” or something else, in order to ensure that the activations map to prediction percentages which sum up to 1. At that point, it makes sense to me that we wouldn’t want a negative activation (because how can you have a prediction which is less than zero?). But until we get to that point, it strikes me as premature to zero out our activations, for the reasons I described.
I understand the necessity of non-linear layers in general, i.e. they’re needed to help our model conform to the universal approximation theorem (otherwise, a series of linear layers with no interleaving non-linearities could be reduced into a single linear layer, and if you only have one layer then you’re not taking full advantage of a model’s ability to learn and improve through training). I’m just confused on how using ReLU isn’t considered “lossy”.
I feel like there must be something I’m mis-understanding about what it means when an activation has a negative value.
ReLU isnt used to ensure activations fall within a certain range, it is used as an activation function used to introduce non-linearity in the model. (while it can def make values be in the range of 0 to +inf, that’s not its main purpose or why its put there)
Well yes, but remember we aren’t really focused on activations and their values, but rather the parameters of our neurons and their values. The model will alter values of the weights and biases until it gets most useful values.
So a neuron might output a zero (after getting clipped by a ReLU) but the next neuron’s parameters might make it output a positive value next time and so on.
Another angle on @jimmiemunyi 's fine reply is that a given ReLU activation will not be zero for all input samples. It will be zero for some and non-zero for others, therefore it performs a useful function in approximating the training set.
If an activation can be shown always to be zero for every conceivable input, the unit can be removed in theory. There’s a whole literature about this, filed under “pruning”.
Thanks @jimmiemunyi. The animation didn’t appear to have any audio, and I don’t have a math background and wasn’t able to deduce what was happening just from the visuals. The intended audience might be a smarter or more math-savvy person than me haha.
That said, the book that the website advertises looks useful. I might check it out after I finish the FastAI course if I still have confusion around this topic. Thanks for the reply.
Not wanting to make you paranoid, but I’d like to point out the dying ReLU problem, an extension of your question. Essentially, if many activations are zero (i.e. before applying ReLU, they were negative), there’d by “dead” neurons in your network, thereby prohibiting it from properly learning, and since the gradient of ReLU when x < 0 is 0, the effects are likely irreversible. Leaky ReLU aims to fix that by introducing a small slope for negative values, but it is not always such a great option and gives mixed results (Stanford course notes for more info on pros & cons of various activations functions).
Based on my experience, however, you’re unlikely to encounter this issue as long as you’re not training with too large a learning rate, and the network’s weights are being properly initialized. To become more familiar with this, you can train different networks with different settings (e.g. very large learning rates, random initialization, etc.) and plot their activations to check which ones are mostly dead, and how that affects their score.
So, does this mean that we don’t really care what the values of the activations are in the intermediate layers, only what they are after the final layer? i.e. while they’re making their way through the model, they lose any real-world meaning the original inputs or eventual outputs might have?
If you’ll indulge me with a metaphor: this process seems similar to an assembly line for a car factory. While the inputs are going through the line, the car is half-finished and is not drivable or usable at all. It might not even be recognizable as something that will eventually become a car. But by the time it comes off the line, it looks and drives like a car. We don’t care what the car looks like when it’s 50% of the way through the line, only what it looks like when it comes off the line. Similarly, we don’t care what the activation values are 50% of the way through the model, because they don’t necessarily “look like” what our final predictions will look like. In the case of image recognition, they might look like edges, or gradients, or odd shapes, etc. that the layers eventually learn to detect.
In this metaphor, our neural network parameters are represented by things like how much paint is applied to the exterior, how many stitches are used to sew the upholstery in the interior, how tightly the bolts or screws on the tires are rotated, etc. In other words, all of the many variables that go into producing that final car, and which determine whether or not the car is something we’d want to buy.
Thanks for humoring me lol. I’ve found that my brain learns difficult concepts best if metaphors are used. Is the above accurate, or are there holes in the logic above?
I think you’re right that the values of activations in the intermediate layers do not have interpretable meaning similar to the final layer activations. The network just does what it needs to do with them to optimize the loss.