Why Relu works at all when Gradient = 0?

If a single gradient becomes 0, there are nearly thousand other backpropogated units which have nonzero gradient (actually 1) gets affected by it because when we find dJ/dw=…= 1 1 0 1 1 1 1*…1, 0 is multiplied and it ruins everything (1 useless vote rules over 100’s useful votes ). How does rely actually overcomes this?

I know leaky relu is a solution to this problem but I want to know how Relu works if this is what actually happens many times.

Well, there are multiple batches that are given to the network through training, so a single time at which the gradient is zero does not mean the weights are going to be useless on future passes.

I think we also need to consider whether we need to look at all of the actual values anyway. There are for example sparse neural networks, where the whole point is to reduce the total number of computations. I am still learning, but I believe an active area of research is how to decrease the total number of activations in order to decrease hardware requirements. In this regard we can actually look at leaky ReLU and wonder, “Why does is do better?”

Also there is a lot of addition in matrix multiplies as well. So a 0 in the beginning does not mean all downstream activation will also be 0.

We are effectively not training the weight if it is not activating. Though if it doesn’t activate enough it could be in trouble due to things like weight decay.

2 Likes