If a single gradient becomes 0, there are nearly thousand other backpropogated units which have nonzero gradient (actually 1) gets affected by it because when we find dJ/dw=…= 1101111*…1, 0 is multiplied and it ruins everything (1 useless vote rules over 100’s useful votes ). How does rely actually overcomes this?
I know leaky relu is a solution to this problem but I want to know how Relu works if this is what actually happens many times.