Accumulating Gradients

kcturgutlu · March 30, 2019, 1:11pm

So sorry for that, it’s 100% my fault . I was hoping to get stable results and fixing batchnorm issue. I should probably create a doc for it explaining what this callback is for and what is it’s limitations.

paul · March 30, 2019, 6:33pm

What do you think is the simplest way to fix the batchnorm issue in a way that doesn’t change too many parts of fastai?

kcturgutlu · March 31, 2019, 10:58am

Probably using instance norm or group norm, but in experiments it didn’t work for every dataset. For example, it worked in case of mnist but not for dog breeds.

sgugger · March 31, 2019, 1:20pm

For BatchNorm, I realize what was wrong with my layer now that we have had to debug a vanilla implementation with Jeremy. I don’t think it’s fixable without using the modified version Jeremy will introduce in next course, but I’ll think about some way to do it (basically the update on training mode is done with the statistics of the batch, not the moving average, the moving average is only used at validation, so we need to find a way to trick BatchNorm into using the stats of the accumulated batches).

paul · April 4, 2019, 4:32am

I understand your comment a bit better after course 10. It may be useful to have a variant that mixes both solutions (running batchnorm and accumulating batchnorm). You might already have it worked out; I will have to sleep over it.

hwasiti · April 15, 2019, 3:55pm

When I watched lesson 10, I said to myself, YES… For proper statistics we should accumulate sums and sums of squares and not the moving average… How clear it is now… This is why studying the basics in part2 of the course is so important… I didn’t quite understood BN, so I didn’t know what was the problem in running BN moving average…

I think my source of confusion is that during validation the BN is working with moving average, right?
But couldn’t we do the validation too, with Jeremy’s modified method in the training mode (using the stats of the accumulated batches)? and why validation is different from training?

Thanks Sylvain for your exceptional efforts and thanks to Jeremy… You haven’t let us down… Jeremy replied 1 month ago that

But here is a great developer who couldn’t let it go until he solved it with you Sylvain…

Now I am trying to use Jeremy’s RunningBatchNorm class with my pets notebook experiments… I have created two classes RunningBatchNorm2d and RunningBatchNorm1d to replace all BN types in resnet18…
I have tried it for couple of hours and modified the resnet18 to include this class instead of BN…
Here is my notebook in nbviewer

I am getting this error when try to run fit or getting the learn.summary()… I will try more to debug it and let you know how things will go… Just in case anybody have an idea, please let me know… I think this is something related to different dimensions arrangements with our modified class and the normal BN…

RuntimeError                              Traceback (most recent call last)
<ipython-input-58-bc39e9e85f86> in <module>
----> 1 learn.summary()

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callbacks/hooks.py in model_summary(m, n)
    164 def model_summary(m:Learner, n:int=70):
    165     "Print a summary of `m` using a output text width of `n` chars"
--> 166     info = layers_info(m)
    167     header = ["Layer (type)", "Output Shape", "Param #", "Trainable"]
    168     res = "=" * n + "\n"

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callbacks/hooks.py in layers_info(m)
    158     func = lambda m:list(map(get_layer_name, flatten_model(m)))
    159     layers_names = func(m.model) if isinstance(m, Learner) else func(m)
--> 160     layers_sizes, layers_params, layers_trainable = params_size(m)
    161     layer_info = namedtuple('Layer_Information', ['Layer', 'OutputSize', 'Params', 'Trainable'])
    162     return list(map(layer_info, layers_names, layers_sizes, layers_params, layers_trainable))

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callbacks/hooks.py in params_size(m, size)
    146     with hook_outputs(flatten_model(m)) as hook_o:
    147         with hook_params(flatten_model(m))as hook_p:
--> 148             x = m.eval()(*x) if is_listy(x) else m.eval()(x)
    149             output_size = [((o.stored.shape[1:]) if o.stored is not None else None) for o in hook_o]
    150             params = [(o.stored if o.stored is not None else (None,None)) for o in hook_p]

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
    318     def forward(self, input):
    319         return F.conv2d(input, self.weight, self.bias, self.stride,
--> 320                         self.padding, self.dilation, self.groups)
    321 
    322 

RuntimeError: Given groups=1, weight of size [128, 64, 1, 1], expected input[1, 128, 28, 28] to have 64 channels, but got 128 channels instead

MicPie · May 7, 2019, 7:25pm

Did you tried to run some dummy tensor with the right shape through your adapted BN layer?

Something like this but with the BN layer:

conv_layer = nn.Conv2d(1,16,3)
x = torch.randn(10, 28, 28)
conv_layer(x).shape
# this works too without creating a nn.Conv2d instance: nn.Conv2d(1,16,3)(x)

The problem is with errors in models that don’t have a “line-by-line” forward method that the stack trace is quite confusing…

Another way to debug this is to insert a debugger layer (however the option from able should be easier).

kcturgutlu · July 24, 2019, 7:27am

Running BatchNorm is not a good idea as I understand from the BatchReNorm paper. I also tried it on a recent kaggle competition and it behaved exactly the way they described in the paper. Please, let me know if anyone got it to work though

This section:

fredguth · September 20, 2019, 2:22am

A little off-topic, but I wonder if I can use the idea of a custom OptWrapper to get per sample loss gradient w.r.t the weights (basically remove reduction = sum) and then, turn the reduction back on_backward_begin. I was trying to do it all with callbacks, but I dont know how to get the loss gradient in on_backward_begin and change it to the rest of gradient calculations.

virilo · October 30, 2019, 8:28pm

@sgugger

I changed all the BN layers of a Resnet50 Unet by your AccumulateBatchNorm module (63 replacements); and it exhausts the CUDA memmory.

Do you know if there is any memory leak?

Using batch_size of 1, it cant start a single epoch

I’ve read the code twice, and I can’t understand why.

I’ve tried replacing by GroupNorm and there is no memory leakage at all

TIA

sgugger · October 31, 2019, 1:26pm

I haven’t tested it so it’s very possible there is some bug/memory leak.

florobax · October 31, 2019, 3:12pm

Does it even finish one iteration ? Because if it is not the case I doubt it is memory leak, it probably just stores too much information in memory at once for some reason (maybe the computation graph gets quite complicated with this method).

phucnsp · November 5, 2019, 8:40am

hi @sgugger,
You said that you have removed all batchnorm from Unet but I still see batchnorm in DynamicUnet() class in fastai source code.

sgugger · November 5, 2019, 1:21pm

Yes, but the default in unet_learner is norm_type=None as seen here.

wgpubs · December 21, 2019, 10:33pm

Getting really high train/validation loss with AccumulateScheduler … is this normal because of the “sum” reduction?

Train Loss=1044.117188
Validation Loss=930.394165

EDIT:

The issue has to do with I’m attempting to use this in a language model classification problem … where there is a probability calculated for each actual token. I tried the below but it’s still reporting the large losses I mention above.

class AbSumAccumulateScheduler(AccumulateScheduler):
    
    def on_batch_begin(self, last_input, last_target, **kwargs):
        "accumulate samples and batches"

        self.acc_samples += last_input[0].shape[0]
        self.acc_batches += 1
        
    def on_backward_begin(self, last_target, last_loss, **kwargs:Any):
        #pdb.set_trace()
        n_predicted_tokens = len(last_target[last_target != -1])
        last_loss = last_loss / n_predicted_tokens
        return { 'last_loss': last_loss }

MicPie · December 22, 2019, 9:36am

Are the loss results in the expected range if you divide them by the batch size (2x batch size for valid, if you use the standard fastai setup)?

sebbecht · January 8, 2020, 11:25am

@kcturgutlu, thanks for putting your time into this and sharing with the community! thanks @hwasiti and @sgugger for testing and supporting the development. I’m generally really amazed with the impact fastai have had on me, so cudos to your involvement!

I am still new to python, pytorch and fastai, but have tested the scheduler callback successfully.

I just wanted to clarify:
If I end up changing loss functions (i.e. not just using crossentropy w/ reduction=sum), the main importance is that I sum over the loss of each batch before reduction? as part of my adventure into fastai, i will work on implementing some different loss functions for segmentation and I’m just worried accumulation will add more complexity to this.

Will this way of accumulating gradients have a bad interaction with mixup? (i understand it may not be simple to just answer).

thanks and happy newyear to you and this thread!

orangelmx · May 11, 2020, 7:03pm

I beleive that there is a good solution for fastai 1

https://www.kaggle.com/iafoss/hypercolumns-pneumothorax-fastai-0-831-lb

champs.jaideep · July 25, 2020, 8:15am

@sgugger
could u help clarify

why do we need to average out grad weights,is it necessary
in many floating implementation ,i just see people just do step no averaging.

Should we average out the loss also,will it have any impact on grads ?

 model.zero_grad()                                   # Reset gradients tensors
 for i, (inputs, labels) in enumerate(training_set):
     predictions = model(inputs)                     # Forward pass
     loss = loss_function(predictions, labels)       # Compute loss function
     loss = loss / accumulation_steps                # Normalize our loss (if averaged)
     loss.backward()                                 # Backward pass
     if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
         optimizer.step()                            # Now we can do an optimizer step
         model.zero_grad()                           # Reset gradients tensors
         if (i+1) % evaluation_steps == 0:           # Evaluate the model when we...
             evaluate_model()

jeremy · June 22, 2022, 9:04pm

I don’t think that’s right is it? Momentum impacts running stats, not bn parameters. bn parameters are impacted by gradient accumulation already afaict.