Are vanishing gradients a sign of good training? [Lesson 10]

davidpfahler · July 26, 2019, 1:00pm

TL;DR: In my tests, the better model at the end of Lesson 10 has smaller gradients than the worse model before. Please help me understand why this is the case.

In Lesson 10 at about 1:14:53 Jeremy says that the uninitialized model which uses standard ReLU has difficulty training in the first iterations and then slowly starts training. He goes on to say that he worries about many parameters of the model never getting back into a reasonable place and “maybe the vast majority of them have like zero gradients”.

He then goes on to show how the Kaiming initialized model with GeneralRelu has better performance and less activations close to zero in the different layers, which seems to indirectly proof his hypothesis. However, we never get to see that actual gradients of each layer during training.

So I tried (and maybe failed; please tell me if I did something wrong) to visualize the gradients during training. You can find my modified Lesson 10 notebook in this gist.

If I collected and interpreted the data correctly, the better model with GeneralReLU and Kaiming is actually getting smaller average (mean) and median gradients faster than the uninitialized, regular ReLU model. This seems to me to be the opposite of what Jeremy said (but maybe not meant?) in the course: The gradients of the better model are going towards zero faster than the one of the worse model.

My interpretation of this is that the gradients of the matter model can “afford” to be smaller faster, because they are in a very sensible place very early on, i.e. they have already found a pretty good solution in the function space.

If this is correct, why did Jeremy worry about the gradients being zero? Are zero gradients a bad thing in general or only in certain situations?

Any help is much appreciated! Thanks in advance.

Loob · July 26, 2019, 4:30pm

Hi,

My comment is about the later statements/questions in your post. I didn’t go through lesson 10 of 2019 and have no clue about the exact content.
The way I understood it is that really small gradients are a problem when backpropegated because they are increasingly becoming smaller and smaller. This means that the earlier layers are less and less likely to adjust. So I suppose it would lead to performing suboptimal because maybe these earlier layers should have been tuned. So my guess is that this could be a reason why Jeremy was worried. Vanishing gradients were a big problem for networks with fully connected layers, such that the bigger networks were worse than the smaller ones.
Regarding your interpretation: Maybe the model works ok because the weight init is good to begin with. And/or if you use a pre-trained network than it might be that the problem it was used to solve is similar to your problem so not a lot of training is needed.

Hope reply clarifies the vanishing gradient problem a bit

davidpfahler · July 26, 2019, 4:35pm

Thanks for your reply. My understanding if basically what you said, but it doesn’t fit exactly onto the results I am seeing. My current thinking is that small gradient aren’t generally bad, just bad when your model is still pretty bad and would need to adjust a lot, but cannot because of small gradients. In the notebook I shared above, the better performing and well initialized model had smaller gradients faster during training than the model which wasn’t initialized well and performed worse.

jcatanza · July 29, 2019, 11:34pm

@davidpfahler I agree with your assessment: vanishing gradients aren’t always bad, only when they prevent the model from training until it finds optimal weights.

For a properly trained model, weight updates make smaller and smaller updates as the model converges and parameters approach their optimal values.

After all, the goal of training is to iteratively update the weights and biases so as to drive the loss function toward its minimum value, where its gradients with respect to the weights and biases vanish!