[Need Help] Loss gradients by sample

I need to access the loss gradients by sample, before they are reduced with ‘sum’ or ‘mean’, during training (once every 20 epochs). I don’t want to run all my training with batch size =1.

I was trying to do this with callbacks. But now I think I may need a Learner Wrapper so that I can calculate the Loss without reduction, copy the gradients, and then apply reduction afterwards.

Am I overthinking this? is there a simpler way that I am not seeing?

If you look at get_preds you can see how fastai does this for interpretation (i.e. not during training). This uses the NoneReduceOnCPU context manager which basically just creates a version of the base loss function passing reduction=None (reduction is a standard parameter in pytorch loss functions).
If you want to do this in training you’ll need to be careful about not keeping around references to losses as your loss will contain the gradient history for the whole batch so will use a lot of CUDA memory until you either delete or detach.

I’d try something like: in Learner.onTrainBegin substitute learn.loss_func with a non-reducing version (i.e. creating the base loss function with reduction=None), then in Callback.on_backward_begin you can replace the last_loss in Learner with a reduced version that training should use while doing what you want with the unreduced version (after detaching most likely). There’s probably an existing callback in fastai that modifies loss you can use as a guide for how to get our computed reduced version back to the training loop.

1 Like

Thanks for the reply.
As I see it, correct me if I get it wrong, get_preds return the loss per sample, but not the loss gradient of each sample.

I tried to do something like this with callbacks before, using as example the NoneReduce from Lesson 12, 2019 (10b_mixup_label_smoothing.ipynb).

The problem is that on_backwards_begin the fit method already called loss_batchand I should be able to:

  1. access the loss with its gradients
  2. sum the gradients to be used in the rest of backwards

I don’t know how to do neither (1) nor (2). :frowning:
I expected the Learner to have acess to the loss gradient, but I can’t find out how.
This let me think that maybe I need a OptimWrapper, similar to what is done with Accumulating Gradients, but I am not sure.

Sorry, yes, that would give you the loss per sample not the loss gradients. Not sure quite what you mean by loss gradients per sample. I don’t think these are ever created by either fastai or pytorch. The loss per sample is calculated then reduced to a single scalar loss, then that single loss is back-propagated through the model. There is no per-sample gradient calculation going on.
You’d have to run a separate backward pass for the unreduced loss of each sample which would likely be very inefficient if it’s actually possible. Is that the sort of thing you want? I think you’d have to write your own training loop for that as calling backward is handled outside of callbacks and happens only once per batch.

Oh, you may want to look at https://github.com/uber-research/loss-change-allocation/ - it looks at analysing the loss so as to allocate it to e.g. particular layers. Not sure if they do allocation to inputs but if not their method might be extendable to it. I haven’t looked into the technique much so can’t speak to implementation.

Exactly. But I need the single loss. I want to calculate the Fisher Matrix of the loss w.r.t the weights = \mathbb{E} [(\nabla_w \log p)^2] which is different of \mathbb{E} [(\nabla_w \log p)]^2. To get the right number I need to calculate the square to each sample and sum (or mean) afterwards.

My thought was that if you say reduction None the backwards will give you a tensor, not a scalar. And then I need to transform this tensor in a scalar in the on_backward_begin callback.

Does it make sense?

To be clear, by sample you mean a single item of your batch? The unreduced loss should be a 1D tensor of length batch_size. By default fastai/pytorch will compute the mean (or optionally sum) of that and use that as the loss.

Exactly! I need the gradient of the loss for each item of my batch.

If I had batch_size of==1, I would have what I want, but I don’t want to calculate it sample per sample, I want to do the operation in batch and get the resulting tensor before it is reduced by a sum. I guess this is what reduce=None means, right?

I could:

class Fisher(LearnerCallback):
    def __init__(self, learn:Learner):
        super().__init__(learn)
     
    def fisher_loss_func(self, pred, yb):
        #NoneReduce from 10b_mixup_label_smoothing.ipynb
        with NoneReduce(self.old_loss_func) as loss_func:
            loss = loss_func(pred, yb)
            return loss
        return self.old_loss_func(pred, yb)        

    def on_train_begin(self, **kwargs):
        self.old_loss_func, self.learn.loss_func = self.learn.loss_func, self.fisher_loss_func
    
    def on_backward_begin(self, **kwargs):

...

and then on_backward_begin I would have to access the 1D tensor of batch_size with the loss gradient for each item in the batch, copy it and then sum it into a scalar, so that the rest of the backward can work as always. I just don’t know how to access this loss gradient tensor in on_backward_begin.

The loss gradient doesn’t exist in on_backward_begin. There is the loss, typically reduced but if you override loss_func as you show then you can avoid that. But there are no gradients yet in on_backward_begin. After on_backward_begin the loss is passed back through the network calculating gradients. So the last module takes the loss and computes the gradient of the loss with respect to it’s inputs and if it has weights the gradient w.r.t. those. Then the gradient w.r.t the inputs, which are of course the outputs of the previous module, are passed to that along with it’s outputs (which are stored along with any intermediates needed in the forward) to compute the gradient w.r.t the second last modules inputs. And so on back through the network. This is all done from a single scalar loss (and pytorch will error if it isn’t scalar). This is at least the basic idea, I may have mixed up my w.r.t. or something through.

So I don’t think there ever is a per-item gradient. You’d have to do something like run each item’s loss through backward separately. Though then I think you’d have to accumulate the gradients as I don’t think you’d want to apply per-item gradients (you’d have the issues you have with small batch sizes which mean you want to accumulate there).

Oh, and you can play with the gradient calculation using torch.autograd.grad (use retain_graph or repeat calulations fail). Something like (adapted from code I’m working and not tested after adaptation):

m = nn.SomeModule() # conv2d etc
# Forward pass
inp = torch.rand(10, requires_grad=True)
y = m(inp) # Could also call a functional like F.relu here
l = y.mean() # Reduced loss
# Backward pass
grad_y, = torch.autograd.grad(l, y, retain_graph=True) # Gradient at output of relu
grad_inp, = torch.autograd.grad(y, inp, grad_out, retain_graph=True) # Gradient at input of relu
# I think the gradient w.r.t. to the weifhts of m should now be in m.grad, but haven't actually used those
1 Like

Oh. Ok. I got it wrong.

That is my problem. Now it is at least clearer.

Maybe have a look at https://towardsdatascience.com/its-only-natural-an-excessively-deep-dive-into-natural-gradient-optimization-75d464b89dbb - it goes into the Fisher Matrix in the context of DL. I don’t fully understand it and mainly skimmed, but a key thing seems to be that you are not dealing with the gradient w.r.t. loss as the backward pass does but instead talkiing about the “gradient of the log likelihood of the model” (emphasis added). That article at least looks to be talking about using it to optimise the gradient descent process, not as a loss function. It looks more related to a gradient of the gradients w.r.t. loss (the second derivative of the loss). But I could be off.

Cross entropy loss is \mathbb{E} - \log p. So its gradient is - \nabla \log p, and as we are going to square it, it is the same as the square of the gradient of the log likelihood of the model.

If you are just wanting to apply this to the outputs of the model, as you would cross entropy, then you want to provide a custom loss function and don’t need a callback. You custom loss function will have access to the per-item outputs of the model. The outputs will have a shape of something like (B,C,…), Batch x Class, where … will depend on input size and model, in image classification with a typical model I think this should be (B,C,1) or it might have been squeezed to just (B,C). The targets should be of shape (B,). Then fastai will take care of calculating the gradient of the weights w.r.t. whatever your loss function returns (to access these gradients you’d use a backward hook callback).
Otherwise I’m still not clear on what you’re trying to do.

Just do a wrapper around your loss, it works perfectly fine. I did it like this:

@dataclass
class URLoss():
    func: nn.Module
    loss: torch.Tensor = None
    reduction: str = 'mean'
    do_spatial_reduc: bool = False
    axis: int = 1

    def __post_init__(self):
        self.func.reduction = 'none'

    def __call__(self, input, target):

        self.func.reduction = 'none'
        target = target.squeeze(1)
        self.loss = self.func(input, target)
        if self.do_spatial_reduc:
            self.loss = self.loss.view(self.loss.size(0), -1)
            self.loss = self.loss.mean(-1)
        if self.reduction == 'mean':
            return self.loss.mean()
        elif self.reduction == 'sum':
            return self.loss.sum()
        else:
            return self.loss 

You need to import dataclasses.dataclass for this. The do_spatial_reduc attribute has to be True for cross-entropy, as it yields a totally unreduced loss (even over spatial dimensions) when reduction='none'. As I created custom losses that always reduce over spatial dimensions, I added this argument. Then you can probably access per-item gradients with self.loss.grad using on_backward_end as it should still be in the computation graph when you backward on the reduced loss. I did not test it though, so can’t be sure.

I am deeply sorry I did not understand.

It seems you are flattening the loss, but still reduces it in the end with mean or sum. So,
how do I get the per sample item grad?

I think I get, I will try here.

I am only flattening over the spatial dimensions, so that loss has shape (batch_size, h*w) and then when mean is computed it just has shape batch_size. If by per-item you mean for each pixel, you can just skeep the part with do_spatial_reduc. Same if you use a loss that already returns something of shape batch_size when reduction='none' (for instance dice loss will always be reduced over each image).

I will try to do this and share the notebook.

2 Likes

@fredguth would you be so kind as to share your findings?