Hi all, I haven’t found a discussion about this here yet and googling it has led to more confusion so I figured I’d go ahead and raise a new topic.

In Fastbook, Chapter 4, two different syntaxes are presented for stepping weights (when .data is used or not) and subsequently zeroing gradients (assigning None versus using .zero()_).

Two times the syntax is shown like below:

Stochastic Gradient Descend - An End-to-End SGD Example - Step 5: Step the weights

params.data -= lr * params.grad.data
params.grad = None

And similarly in the BasicOptim class:

Putting It All Together: Creating an Optimizer

def step(self, *args, **kwargs):
for p in self.params: p.data -= p.grad.data * self.lr
def zero_grad(self, *args, **kwargs):
for p in self.params: p.grad = None

However in Putting It All Together in the train_epoch the syntax is as follows:

for p in params:
p.data -= p.grad*lr
p.grad.zero_()

So ultimately my questions are, why does this second syntax example not use p.grad.data * lr like the previous examples and is there a difference between p.grad = None and p.grad.zero_()?

When trying to Google this and looking at the Fastai source code I also came across the use of with torch.no_grad(): and detach() which further added to the confusion. I’m pretty sure with torch.no_grad(): just functions as a context manager that negates the need for .data but I could use clarification on this and more information about when to use / how detach() works as well.

As for p.grad.zero_() and setting to None, there is definitely a difference in the Python/PyTorch language. But I imagine that they have the same effect on training. I think the next forward pass would create the gradients if they do not exist. But you would need to test to be sure (or study the PyTorch’s autograd code).

After experimenting some with the Chapter 4 notebook it seems that p.data -= p.grad.data * lr and p.data -= p.grad * lr are functionally equivalent, and p.grad = None and p.grad.zero_() are also functionally equivalent

However, after reading through the discussion you linked it seems that using with torch.no_grad() is the preferred method now instead of using .data.

I’m still a little confused about where I would use detach(). I understand in practice what it does (returns a view of a tensor without grad data) but I think it’s use-cases may be beyond my knowledge right now.

@gandersen101
I would prefer not to use detach(). Mostly, because it creates a copy of the tensor, and as @Pomo mentioned, that is memory inefficient. Moreover, with torch.no_grad() is a really sophisticated way to handle weight update. The idea is basically to not track the changes during the weight update step. We wish to track changes in parameters during forward propagation (multiplying with weights and adding them together continuously), which counts towards the gradient calculation. By default, Pytorch would also give you gradients with respect to the weight update equation, which is undesirable, if you think of it. So with torch.no_grad() makes sure that gradients are simply not calculated with respect to this step.

Rgarding detach, it also basically means the same. You are essentially making a copy of the tensor, so when you update the weights using the detached tensor, there is no effect on the operation history of the original tensor. But again, it uses extra memory - and we are dealing with millions of parameters.

You don’t need detach for this course at least. Ofcourse you can use it elsewhere. Only think of it as a method to create an identical copy, that is otherwise unrelated to the original tensor. If you are aware with the concept of pointers, basically a tensor copy is created with the same value, but different pointer.