I have a question regarding the `update()`

function defined in lesson 2:

```
def update():
y_hat = x@a
loss = mse(y, y_hat)
if t % 10 == 0: print(loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)
a.grad.zero_()
```

Having seen the update rule for linear regression couple of times from a mathematical perspective, I am a little surprised by how this works. Especially the `loss.backward()`

call. This looks something like this:

(theta are the params, J is the loss function and h_theta(x) is the y_hat)

If you would do this mathematically, you would generally first compute the gradient (derivatives) of the loss function with respect to the parameters. And once you have their functional form you can plug in your `y`

(labels), `a`

(parameter estimates) and `x`

(feature) values.

If I am reading this, then I’m seeing that they first compute the loss, which is mathematically just a scalar, and from that scalar they are still able to compute the gradient…

I guess it has something to do with the fact that what is returned from `mse()`

is actually not a scalar but a rank 1 tensor which apparently seems to store all the stuff that actually went into it (e.g. `y`

, `a`

and `x`

) and is somehow still able to compute the derivatives with respect to `a`

correctly.

Nonetheless this seems quite “magical”, would be really grateful if somebody could shed some light on this! Would be also great to understand a little better how PyTorch is computing gradients. Is it doing that analytically?