04 Mnist Basics - Why we set grad of params to None

The walkthrough of Mnist using SGD shows the params getting updated and then the grads for respective params being set to zero. Why is that needed?

params.data -= lr * params.grad.data
params.grad = None


In PyTorch, by default, gradients are accumulated - i.e., when the backward pass is conducted, the new gradients are added to the old ones and do not replace them. This is the desired behaviour in some cases, such as training RNNs, but in other instances, including the MNIST walkthrough, it does not make sense to aggregate old gradients and new ones since the former have already been used to update the model. Thus, the gradients need to be zeroed out or set to None subsequent to every iteration.