In the notebook of Lesson 2, we use codes below to update the coefficients.

def update():

y_hat = x@a

loss = mse(y, y_hat)

if t % 10 == 0: print(loss)

loss.backward()

with torch.no_grad():

a.sub_(lr * a.grad)

a.grad.zero_()

However, I have some questions about it.

It uses *a.grad* to calculate the gradient of *a* just one time.

But, as I know, stochastic gradient descent uses one input sample to update the gradient and repeat for n times(n refers to the number of samples). The codes up here look like the Batch gradient descent to me(calculate the gradient with all samples each epoch)…(if I’m wrong, please correct me)

The calculation of the gradient happens in **loss.backward()**. The a.grad just stores the gradient generated in the previous command. That’s why you need to clear it’s ‘memory’ with a.grad.zero_()

Also, as you’ve said, the lesson-2-sgd code shows regular gradient descent. The stochastic aspect is explained mainly in the lesson video and I think he elaborates a bit more in Lesson 5 when talking about optimization methods and implementing a single layer neural net in excel.