Accumulating Gradients

A little off-topic, but I wonder if I can use the idea of a custom OptWrapper to get per sample loss gradient w.r.t the weights (basically remove reduction = sum) and then, turn the reduction back on_backward_begin. I was trying to do it all with callbacks, but I dont know how to get the loss gradient in on_backward_begin and change it to the rest of gradient calculations.


I changed all the BN layers of a Resnet50 Unet by your AccumulateBatchNorm module (63 replacements); and it exhausts the CUDA memmory.

Do you know if there is any memory leak?

Using batch_size of 1, it cant start a single epoch

I’ve read the code twice, and I can’t understand why.

I’ve tried replacing by GroupNorm and there is no memory leakage at all


I haven’t tested it so it’s very possible there is some bug/memory leak.

Does it even finish one iteration ? Because if it is not the case I doubt it is memory leak, it probably just stores too much information in memory at once for some reason (maybe the computation graph gets quite complicated with this method).

hi @sgugger,
You said that you have removed all batchnorm from Unet but I still see batchnorm in DynamicUnet() class in fastai source code.

Yes, but the default in unet_learner is norm_type=None as seen here.

1 Like

Getting really high train/validation loss with AccumulateScheduler … is this normal because of the “sum” reduction?

Train Loss=1044.117188
Validation Loss=930.394165


The issue has to do with I’m attempting to use this in a language model classification problem … where there is a probability calculated for each actual token. I tried the below but it’s still reporting the large losses I mention above.

class AbSumAccumulateScheduler(AccumulateScheduler):
    def on_batch_begin(self, last_input, last_target, **kwargs):
        "accumulate samples and batches"

        self.acc_samples += last_input[0].shape[0]
        self.acc_batches += 1
    def on_backward_begin(self, last_target, last_loss, **kwargs:Any):
        n_predicted_tokens = len(last_target[last_target != -1])
        last_loss = last_loss / n_predicted_tokens
        return { 'last_loss': last_loss }
1 Like

Are the loss results in the expected range if you divide them by the batch size (2x batch size for valid, if you use the standard fastai setup)?

@kcturgutlu, thanks for putting your time into this and sharing with the community! thanks @hwasiti and @sgugger for testing and supporting the development. I’m generally really amazed with the impact fastai have had on me, so cudos to your involvement!

I am still new to python, pytorch and fastai, but have tested the scheduler callback successfully.

I just wanted to clarify:
If I end up changing loss functions (i.e. not just using crossentropy w/ reduction=sum), the main importance is that I sum over the loss of each batch before reduction? as part of my adventure into fastai, i will work on implementing some different loss functions for segmentation and I’m just worried accumulation will add more complexity to this.

Will this way of accumulating gradients have a bad interaction with mixup? (i understand it may not be simple to just answer).

thanks and happy newyear to you and this thread!

I beleive that there is a good solution for fastai 1

could u help clarify

  1. why do we need to average out grad weights,is it necessary

  2. in many floating implementation ,i just see people just do step no averaging.

  3. Should we average out the loss also,will it have any impact on grads ?

     model.zero_grad()                                   # Reset gradients tensors
     for i, (inputs, labels) in enumerate(training_set):
         predictions = model(inputs)                     # Forward pass
         loss = loss_function(predictions, labels)       # Compute loss function
         loss = loss / accumulation_steps                # Normalize our loss (if averaged)
         loss.backward()                                 # Backward pass
         if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
             optimizer.step()                            # Now we can do an optimizer step
             model.zero_grad()                           # Reset gradients tensors
             if (i+1) % evaluation_steps == 0:           # Evaluate the model when we...

I don’t think that’s right is it? Momentum impacts running stats, not bn parameters. bn parameters are impacted by gradient accumulation already afaict.