Efficient per sample gradients

fredguth · September 18, 2019, 9:45pm

When we get the loss gradient in pytorch it actually sums the gradients of each sample in the batch. But there are situations when you want to access each sample gradient, perform operation and just after operation to sum or get the mean.

One of such cases is when you want to calculate the fisher matrix of the weights.

I have been on pytorch forums and it seems it does not let you change that behaviour. You get a sum or a mean and that is it.

It is possible, though, to get these per sample gradients but you will need to calculate and keep he gradient of the loss w.r.t. the activations and also keep hidden layers values (Goodfellow 2015).

I was wondering how could I do that with callbacks. My main concern is how to get this gradient w.r.t. activations, which seems something you would add in forward. But there is no fastai forward callback, is there?

Any suggestions?

Diganta · September 19, 2019, 8:20pm

You should take a look at this paper - https://arxiv.org/abs/1909.01440
Although it doesn’t exactly answer your question but I believe you’ll get some good insights from this.
The GitHub of this paper is also public.

fredguth · September 19, 2019, 10:35pm

Thanks for the suggestion.

I actually know how to calculate the per sample gradients from sum of gradients using https://arxiv.org/abs/1510.01799.

What I don’t know is how to compute the gradient w.r.t the activations in the forward. I was looking for the use of hooks in fastai and decided to watch lessons 10 and 12 v3 (2019). Now, I think I may not need to use this trick… because in the Mixup Callback the loss is computed per sample as I need. I will try that. It is interesting that in the pyTorch forums there is a lot of conversation on this subject and nobody mentioned that you can change the loss reduction to None, what I expect will return the gradients in a tensor, not summed up, as I want.