Understanding gradient calculation

hi
can some one help understand the gradient calculations here,how does the backward pass formula fits in ,which is provided in many notes

  1. What is inp.g ,does these calculations happens in place

  2. from where does out.g gets calculated.

    def lin_grad(inp, out, w, b):
    # grad of matmul with respect to input
    inp.g = out.g @ w.t()
    w.g = (inp.unsqueeze(-1) * out.g.unsqueeze(1)).sum(0)
    b.g = out.g.sum(0)

  1. inp.g is just a property. With a PyTorch tensor, you can add whatever property you want to a tensor (inp.whatever will work as well). So .g is just used a place to store gradients.
  2. out.g is calculated by earlier _grad functions. The initial one is calculated in mse_grad as gradient of the last activation w.r.t the loss function.
1 Like

Sorry for not answering your question, but I was wondering if a lot of people want to know the details of gradient calculations? If there’s enough interest, I’ll be happy to do a whiteboard session for people located in the Bay Area in which we’ll go through each of these gradient calculations in detail.

3 Likes

Jeremy mentioned about a blog he had written along with a fellow prof in usf. It is explained.ai. that blog provides a damn good introduction to this topic.

1 Like

Great suggestion: I was just looking through that matrix calculation paper for my own review: https://explained.ai/matrix-calculus/index.htm

@champs.jaideep Sections 5&6 go over gradient calculations. Start from the top of the paper if you need to review for linear algebra and calculus (partial derivatives)

2 Likes

thanks i was looking to know math behind this …
like the formula
for any layer dw= dz * input like that … which we see in various blogs etc

The forward pass of a fully-connected layer computes the following:

 f[1] = w[1,1]*x[1] + w[2,1]*x[2] + ... + w[n,1]*x[n] + b[1]
 f[2] = w[1,2]*x[1] + w[2,2]*x[2] + ... + w[n,2]*x[n] + b[2]
 ...
 f[m] = w[1,m]*x[1] + w[2,m]*x[2] + ... + w[n,m]*x[n] + b[m]

where:

  • m is the number of neurons in that layer,
  • n is the number of inputs,
  • x is a vector of inputs (n in total),
  • w are the weights (a matrix of size n * m),
  • b are the biases (m in total),
  • and f is the output of the layer (also a vector of m elements).

(Note: I started counting at 1 here, not at 0. It doesn’t really matter.)

The backward pass takes an incoming gradient and computes three other gradients from that.

The incoming gradient is that of the loss L with respect to f, or dL/df. This is a vector of m numbers because this layer has m neurons and therefore m outputs.

The three gradients computed are:

  1. dL/dx – the loss w.r.t. the inputs of the layer
  2. dL/dw – the loss w.r.t. the weights of the layer
  3. dL/db – the loss w.r.t. the biases of the layer

Since there are n inputs, dL/dx is really a vector of n elements. dL/dW is a matrix of n * m elements, and dL/db is a vector of m elements.

To understand how to compute these vectors, it is useful to look at a single element at a time. For example, let’s look at dL/dx[1]. In the forward pass, the only terms that have x1 in them are:

w[1,1]*x[1]   in f[1]
w[1,2]*x[1]   in f[2]
...
w[1,m]*x[1]   in f[m]

Since none of the other terms in the formula for f have x[1] in them, those terms become 0 when we compute the partial derivative w.r.t. x[1] (they are treated as constants and the derivative of a constant is 0).

Thanks to the chain rule, we know that dL/dx[1] = dL/df * df/dx[1]. We already know dL/df, which is a vector of m numbers (this is the incoming gradient).

So now we have to compute df/dx[1]. This describes by how much the value of f changes if x[1] becomes larger. Since f is a vector, let’s look at how x[1] affects each term of f:

df[1]/dx[1] = w[1,1]
df[2]/dx[1] = w[1,2]
...
df[m]/dx[1] = w[1,m]

I hope this makes sense, because f[1] = w[1,1]*x[1] + ... where ... is other stuff that are considered constants when we take the derivative w.r.t. x[1].

In other words, if x[1] is incremented by 1, then the output of f[1] changes by the coefficient of x[1], which is w[1,1]. And so the derivative of f[1] w.r.t. x[1] is w[1,1]. Likewise for the other elements of f.

And also for df/dx[2] and the other elements of x:

df[1]/dx[2] = w[2,1]
df[2]/dx[2] = w[2,2]
...
df[m]/dx[2] = w[2,m]

And so on… you can see the pattern here.

Now we know what the loss is of every output element of f w.r.t. every element value input x. We can put these into an m * n matrix. And if you hadn’t guessed yet, that is exactly the weight matrix again.

So df/dx is really the weight matrix w. That makes sense because the influence that a given input x[i] has on the output f, is described by the weights between the inputs and the hidden neurons.

Finally, to get dL/dx, we multiply dL/df – which if you recall is the incoming gradient vector of m values – by df/dx, which is really just the weight matrix. Multiplying a vector of size m with a matrix of size m*n gives a new vector of size n. Which makes sense because x has n elements, and so dL/dx should also have n elements.

Phew, that was a long explanation of this single line in the above code:

inp.g = out.g @ w.t()

Here, inp.g is dL/dx, out.g is dL/df, and w.t() is the weight matrix or df/dx. (It was explained in lesson 9 why you’re taking the transpose here. In short, it’s because the weights are not stored as an m * n matrix in PyTorch but as n * m, for historical reasons.)

So now that you know how to compute dL/dx, can you figure out the math to compute dL/dw and dL/db?

It’s the exact same method: for dL/dw you figure out how much a change in each weight value w[i,j] contributes to f. And for dL/db you figure out how much a change in each bias value b[i] contributes to f.

18 Likes

[quote=“machinethink, post:7, topic:42770”]
Phew, that was a long explanation of this single line in the above code:
/quote]

It was a really good one though! :slight_smile:

Thanks. :smiley:

One thing I didn’t mention is the effect of using mini-batches on this calculation. So for anyone following along with the math, note that the PyTorch code assumes there is a batch dimension in the data (and therefore also in the gradients). This affects the calculations for the weights and bias gradients because the same weight and bias values are used with multiple input vectors. So keep that in mind. :wink:

One thing I can’t wrap my head around is how the gradients from a batch are aggregated in PyTorch. This post and reply didn’t really help me understand if gradients are summed or averaged. Maybe I’m missing something in that response, but it seems like they have to be one or the other. Anyone know?

2 Likes

The answer is kind of “averaged”, but really the answer is more like “neither”—you just calculate the loss and take gradients, there’s no need to explicitly average or sum anything.

In practice the losses we use are all averaged over the batch (so add up a little individual loss for each training example and then divide by the batch size), so yeah, in a sense the gradients are averaged over those little losses, but the semantic issue you might be getting tripped up on is that there’s just one overall loss. It’s true that its gradients are equivalent to averaging gradients over its little consitituent losses, but there’s no need to think of it that way—you just have some loss and you calculate its gradients, that’s it.

2 Likes

That’s right @cqfd . The loss returns a scalar, aka a rank-0 tensor. Most loss functions have a reduction param where you choose how the loss is reduced (mean/sum/none). So it’s not the gradients that are aggregated, but the loss.

3 Likes

Yep, that makes sense. Not sure why I was conceptualizing it as a little part of the gradient being associated with the loss from each observation. Thanks!

Hello,

I cant figure out why is there a sum(0) in gradient calculation for b. I see that we need it to get dimension of b correct but somehow math wise i cant wrap my head around it

The gradient of the loss w.r.t. the biases is as follows:

dL/db[1] = dL/df * df/db[1]
dL/db[2] = dL/df * df/db[2]
...
dL/db[m] = dL/df * df/db[m]

Recall that this layer has m outputs, so dL/db will be a vector of m numbers too.

To find df/db[1], look at which parts of f contain b[1]. You’ll find that only f[1] does anything with b[1]:

f[1] = w[1,1]*x[1] + w[2,1]*x[2] + ... + w[n,1]*x[n] + b[1]

So to take the partial derivative of df/db[1], we only need to care about f[1].

In f[1], the terms that do not involve b[1] are treated as constants (because we’re taking the partial derivative) and therefore become 0. The term with b[1] becomes 1.

And so we get df[1]/db[1] = 1.

(If this seems confusing, think of the math function g(x) = a + x. What is the derivative dg/dx? The a is constant and becomes 0, the x becomes 1. The same thing happens here except the variable is not called x but b[1].)

But let’s not forget about the rest of f either. We have df[2]/db[1] = 0, and also df[3]/db[1] = 0, and so on. The formulas for f[2], f[3], etc do not use db[1] and so their derivative w.r.t. db[1] is 0.

Putting this all together gives df/db[1] = [1, 0, 0, ..., 0]. Because f is a vector of m elements, so is df/db[1].

The same thing is true for all the other df/db[i]. For example, df/db[2] = [0, 1, 0, 0, .... 0].

You can put all of df/db into a matrix of m * m elements that is all zeros except on the diagonal. In other words, it is an identity matrix.

Finally, we go back to our original formula;

dL/db = dL/df * df/db

We already know dL/df, which is a vector of m numbers (this is the incoming gradient). So here we could do a matrix multiply between an m-element vector and an m * m matrix to get a new m-element vector. But we don’t actually need to do this since df/db is an identity matrix, and multiplying with an identity matrix doesn’t change anything.

Long story short, dL/db = dL/df. The gradient of the loss w.r.t. the biases is the exact same as the incoming gradient.

This makes sense, because making one of the bias values larger by 1 will also make the gradient larger by 1. If you change db[i] by some amount z, then dL/df[i] also changes by z.

So why is there a sum()? Recall that we don’t do the above calculations for a single example but for an entire batch of examples. The gradient is computed over the entire batch, which is done by summing everything together.

Consider that for a loss such as MSE we actually compute the following:

L = [MSE(f(x_1), y_1) + MSE(f(x_2), y_2) + ... + MSE(f(x_bs), y_bs)] / bs

where x_1 is the first example in the batch, etc. bs is the batch size. L is the total loss, which is the mean of the MSE for each example in the batch.

The gradient of the loss L w.r.t. the model output f, i.e. dL/df, is actually for the entire batch. Previously I said this is a vector of m elements but it’s really a matrix of bs * m elements because there are bs examples in the batch.

If you work out the math, you’ll see that each example in the batch uses the same b[1], the same b[2], and so on. But each example will have a different incoming gradient dL/df. Because we want to update b with the gradients from all the examples, and we have bs different gradients, we need to sum them up (because in the formula for L they get summed up too).

4 Likes

Thank you for your great explanation. When you were explaining inp.g gradients in an earlier post you said that df/dx is weight matrix and that we use transpose because of Pytorch. Are you sure that’s the case here? I don’t think that has anything to do with Pytorch. The derivation is weight matrix transposed.
Am I right?

It really just depends on how you decide to store the weights in the matrix.

Great explanation here…I have a question though; why are we calculating the gradients for the input (x_train) in the first linear layer in the backwards loop? (inp.g = out.g @ w.t()) We are not going to adjust the inputs I think, and we are not actually using the gradients for the inputs anywhere that I can see. We later adjust just the weights based on the gradients in the training loop (in lesson 9) but we are never using the gradients from the input matrix, do we?

If you’re not using the gradients, you don’t have to compute them. :smiley:

(In something like neural style transfer, you do want the gradients for the input image, because there you’re optimizing the image’s pixels. And in that case you don’t need the gradients for the convolution weights, so you wouldn’t compute those. So it just depends on what you want to do.)

Just to chime in to this thread: I found this explanation really clarified things for me – I think because it included concrete values.

This is testament to how basic my level of understanding is, but:

Based on that page’s example, I was amazed that the gradient really was just x's multiplicator (y), so I made a Google Sheet to prove it.

In the sheet, x increases by a tiny amount each time – while the other variables stay the same – then the effect that this has on the outcome of g is recorded, and – lo and behold – for every tiny increase in x, g decrements by y's value (-4) !

(Note: the “G step” column says 0 because I think Google Sheets is bad at recording the tiny change in g's value).

2 Likes