Please let me know if my understanding is correct
We requires_grad= True for a is because we want to optimize this parameter.
We loss.backward() because we want to minimize loss to find a
a = torch.tensor([-5.,5.],requires_grad=True)
y_hat = x@a
loss = mse(y, y_hat)
#print loss every 10 loops
if t % 10 == 0: print(loss)
#compute the derivatives, you can call .backward()
# To prevent tracking history (and using memory), you can also wrap the code block in with torch.no_grad():.
a.sub_(lr * a.grad) # w(t) = w(t-1) - lr dL/dw(t-1)