Okay. I was just thinking out aloud with the limited stuff I have been learning and understanding.

Just wanted to say that this comment has been a real eye opener Thank you

Yeah, I think that is what we are doing here and we canât all be right all the time or else no learning would be happening So I guess we are on the right track here

Cool question from you imo and an insightful reply from @jeremy

BTW. this led me to thinking - we could possibly use the norm of the gradient to figure out if we have converged as opposed to staring at the loss and thinking whether we are there yet In reality, probably using the loss is still better instead of using a proxyâŚ but then maybe not necessarily? Maybe with adaptive learning rates the gradient that is an accumulation over many batches carries some interesting information that the train loss on a given batch doesnât contain?

Well, sorry, donât mean to get ahead of myself but through the question that you asked @ravivijay I definitely learned something new so thanks for asking

Cool. My question stemmed from recent video from Ian Goodfellow, https://www.youtube.com/watch?v=XlYD8jn1ayE&t=5m40s at 5m40 seconds, he says that even though loss might go down, gradient norm may go up. He says itâs fine and expected in DNNs, but I thought it might be worth digging more while relating to this question.

Thank you @ravivijay This is an amazing talk. I came across this linked on Twitter but someone only linked to the part on numerical stability (which btw are great 20 minutes to invest into watching when someone wants to start using pytorch and hasnât done low level computations in quite a while - I had quite a few bugs in my first pytorch implementation that were ridiculous once spotted but this talk definitely would have helped!)

With regards to the part that you link - wow! I do not think that Goodfellow ever explains why that is the case that the gradient doesnât start approaching 0. I mean, I get it that we do not get to a minima in a mathematical sense, but I would expect the gradient to at least start falling a bit! Instead, we seem to accumulate the gradient as training progresses and then its magnitude levels off.

This is very interesting. I think this effect of gradient norm not falling has also to do with the training algorithm used, but it definitely raises interesting questions about the weight space. I wonder what would happen on simpler examples or if there is anything relevant about where we arrive in the weight space based on what happens to the gradient norm. And what would happen if I continue to train but with smaller learning rate?

Through @jeremyâs comments that left me thinking about this qutie a bit, I started working on a continuation to the first medium post, and here are a couple of relevant sentences I think that would be not something I would write a week ago but this is a distinction that I think I now was able to make:

What is a minimum? From a mathematical perspective, this is a well defined term. But from the perspective of a practitioner a minimum is a spot in the weight space that our training algorithm reaches when the network is fully trained and is unable to significantly improve upon via further training.

Have a couple of experiments in mind but they are probably a couple of weeks away - still not even done with lecture 1 to the extent that I would like to be.

Anyhow - sorry for rambling. What you linked to is very relevant and very interesting. Thank you.

Interesting comments:

âWhat is a minimum? From a mathematical perspective, this is a well defined term. But from the perspective of a practitioner a minimum is a spot in the weight space that our training algorithm reaches when the network is fully trained and is unable to significantly improve upon via further training.â

My high level understanding is this : The NN can well learn enough even at a local minima or narrow space to satisfy training/validation loss of the classification. The accuracy does not represent absolute optima for the network. It represents a number that satisfies the engineer. To solve a classification task, one can be fine with a system that is good for the task and donât care if it is super optimal in the math space. It is in researchers interest to chase how to reach the optimum or understand where it lies as it would be most efficient in terms of time/resources and robustness of a get a good solution. The question is if gradient norm can be seen as one of the numbers to judge is optimality.

I also agree that we should keep this floating but not get distracted from practicing enough of what @jeremy teaches us and build intuitions as practitioners.

Thanks for posting ths question! Learned a lot from this thread!

Appreciate!