As a preface, I know what gradients are. A gradient is a singular value that tells us the slope of a function at a point. This fact can be used to minimize loss.

Below is my understanding of how gradient descent works.

Let’s use a quadratic as our loss function: f(x) = ax^{2} + bx + c .

Now the minimum loss for this function would be the vertex of this quadratic. Also, the gradient at the vertex is zero. Therefore, our task would be to find the weights that converge the gradient to zero. And to converge the gradient to zero, we can use the gradient itself to adjust the weights.

To calculate the gradient, we can use the corresponding derivative function: f'(x) = 2ax + b .

However, in the chapter 04_mnist_basics, in the section Stochastic Gradient Descent (SGD), we use PyTorch to calculate the gradient using the backward method. This method returns three gradients: a gradient for the weight a, a gradient for the weight b, and a gradient for the weight c. I’m getting confused over this.

Again, the concept of gradients that I know of is that it is a singular value that tells us the slope of a function.

What do the gradients of a, b, and c represent? Each weight has its own gradient, and that isn’t making sense to me.

Ah so this is a good question. Had me stumped for a while too.

What you need to understand is that in the equation y = ax^2 + bx + c, the (x, y) are actually constants. This is due to the fact that we observe this data. Let’s say the loss for one instance (i) is \mathcal{L} = (y_i - (ax_i^2 + bx_i + c))^2, then when you do partial differentiation (i.e. keeping all else constant), you get:

Ooooh, this makes sense now! (I’m having one of those “Aha!” moments right now ).

After a reading a bit online too, what I’m getting is that a partial derivative tells us how a change in one of the variables, or in this case weights, changes the function while keeping all other variables constant. Therefore, each variable/weight has a gradient. I don’t completely understand, but I get the gist.

Not necessarily. It might be the case that the gradients are close to zero towards the end of training, but definitely not equal to. When you consider millions of weights it’s hard to get all their gradients close to zero (and you might not want to). This will become more obvious when you see more examples through the course.

I’m not sure is this is related, but coincidently I watch this video a couple of days ago, and part of my mind’s pattern matching senses a tentative link in the few minutes following where I jump into here… Taylor series | Chapter 11, Essence of calculus - YouTube

Jeremy, I’d be interested in your thoughts.

Even if its not directly related, overall its a cool tutorial about approximating functions.

Also, if you don’t mind me asking, seeing as I’m doing the 2020 course, would it be better to continue with the 2020 course or to continue off with the 2022 course?