[Third Attempt] What do the gradients in gradient descent represent?


As a preface, I know what gradients are. A gradient is a singular value that tells us the slope of a function at a point. This fact can be used to minimize loss.

Below is my understanding of how gradient descent works.

Let’s use a quadratic as our loss function: f(x) = ax^{2} + bx + c .

Now the minimum loss for this function would be the vertex of this quadratic. Also, the gradient at the vertex is zero. Therefore, our task would be to find the weights that converge the gradient to zero. And to converge the gradient to zero, we can use the gradient itself to adjust the weights.

To calculate the gradient, we can use the corresponding derivative function: f'(x) = 2ax + b .

However, in the chapter 04_mnist_basics, in the section Stochastic Gradient Descent (SGD), we use PyTorch to calculate the gradient using the backward method. This method returns three gradients: a gradient for the weight a, a gradient for the weight b, and a gradient for the weight c. I’m getting confused over this.

Again, the concept of gradients that I know of is that it is a singular value that tells us the slope of a function.

What do the gradients of a, b, and c represent? Each weight has its own gradient, and that isn’t making sense to me.

I would highly appreciate clarification on this!


Ah so this is a good question. Had me stumped for a while too.

What you need to understand is that in the equation y = ax^2 + bx + c, the (x, y) are actually constants. This is due to the fact that we observe this data. Let’s say the loss for one instance (i) is \mathcal{L} = (y_i - (ax_i^2 + bx_i + c))^2, then when you do partial differentiation (i.e. keeping all else constant), you get:

\begin{align} \frac{\delta \mathcal{L}}{\delta a}=2 (y_i - (ax_i^2 + bx_i + c))(-x_i^2) \end{align}

You can repeat it for b and c.


Thank you for the response!

Ooooh, this makes sense now! (I’m having one of those “Aha!” moments right now :smile:).

After a reading a bit online too, what I’m getting is that a partial derivative tells us how a change in one of the variables, or in this case weights, changes the function while keeping all other variables constant. Therefore, each variable/weight has a gradient. I don’t completely understand, but I get the gist.

If anyone else comes across this, this Khan Academy article provides a good rundown, provided you know about about regular derivatives already: Introduction to partial derivatives (article) | Khan Academy

Thank you again for the response!

I would like to clarify one more thing: we still need to converge the gradient of each weight to 0, right?

In the new course we do this interactively - it might give you a better intuition.

Not necessarily. It might be the case that the gradients are close to zero towards the end of training, but definitely not equal to. When you consider millions of weights it’s hard to get all their gradients close to zero (and you might not want to). This will become more obvious when you see more examples through the course.

1 Like

I’m not sure is this is related, but coincidently I watch this video a couple of days ago, and part of my mind’s pattern matching senses a tentative link in the few minutes following where I jump into here… Taylor series | Chapter 11, Essence of calculus - YouTube

Jeremy, I’d be interested in your thoughts.

Even if its not directly related, overall its a cool tutorial about approximating functions.

And since @sachinruk mentions partial difference equations, anyone unfamiliar might find this interesting… But what is a partial differential equation? | DE2 - YouTube

Ah, okay. I’ll have a look at the new course too.

Also, if you don’t mind me asking, seeing as I’m doing the 2020 course, would it be better to continue with the 2020 course or to continue off with the 2022 course?

Definitely change to 2022.

1 Like