Accumulating Gradients

kcturgutlu · March 4, 2019, 8:09am

Created a new pull request with current changes.

kcturgutlu · March 4, 2019, 9:19am

It seems like converting BN to GroupNorm is enough even without accumulating gradients. You may check the notebooks I’ve shared

hwasiti · March 4, 2019, 10:24am

That’s very interesting @kcturgutlu . I have tried ~2 weeks ago Group Norm and Instance Norm on the pet’s notebook with resnet50 and the acc was not good…

I will try again using snippets from your code and report back…

kcturgutlu · March 4, 2019, 10:25am

Instance norm didn’t work for me as well. Yeah maybe do groupnorm+acc and groupnorm+no acc

hwasiti · March 4, 2019, 11:49am

I have repeated your notebook as it is on MNIST dataset, and yes Group Norm worked impressively well…
However, when I changed the data into the Pet’s dataset, it didn’t work…

Here are the results and the modified notebook:

model = vgg16_bn (chosen since it’s sequential - easy to manipulate)
data = Pet’s dataset

Experiment Results

No Accumulation
batch_size = 64
acc = 0.90
Naive Accumulation
effective_batch_size = 64
step = 32
bs = 2
acc = 0.63
Accumulation + BnFreeze
effective_batch_size = 64
step = 32
bs = 2
acc = 0.63
Increase BN Momentum (on current batch stat)
effective_batch_size = 64
step = 32
bs = 2
acc = 0.21
Decrease BN Momentum (on current batch stat)
effective_batch_size = 64
step = 32
bs = 2
acc = 0.65
Replace BN with Instance Norm
effective_batch_size = 64
step = 32
bs = 2
acc = 0.03
Replace BN with Group Norm
num_groups=4
effective_batch_size = 64
step = 32
bs = 2
acc = 0.07
ResNet18 + Replace BN with Group Norm
num_groups=4
effective_batch_size = 64
step = 32
bs = 2
acc = 0.03
ResNet18 + Replace BN with Group Norm no Accumulation
num_groups=4
bs = 2
acc = 0.02

I suspect that the MNIST dataset is too simple. Perhaps it is already normalized, like what the the Human Protein Atlas comp winner did (@pudae), when he normalized each image alone before passing them to the model, which is of course weird, and don’t think can be generalized to other datasets.

notebook : https://github.com/hwasiti/fastai-course-v3/blob/master/nbs/dl1/Accumulating_Batchnorm%20(PETS)-v2.ipynb

hwasiti · March 5, 2019, 10:35am

I have tried different runs with num_groups=1,2,4,8,16,32,64
seems the best is 64 :

Replace BN with Group Norm
num_groups=4
effective_batch_size = 64
step = 32
bs = 2
acc = 0.36
ResNet18 + Replace BN with Group Norm
num_groups=4
effective_batch_size = 64
step = 32
bs = 2
acc = 0.12
ResNet18 + Replace BN with Group Norm no Accumulation
num_groups=4
bs = 2
acc = 0.44

I wonder whether groupnorm.weight, groupnorm.bias, groupnorm.eps need to be tuned for each dataset and not simply copy it from BN?

I am testing changing them one by one

hwasiti · March 5, 2019, 10:55am

Why I am getting different accuracy with this change:

accuracy = 0.36:

groupnorm.weight = bn.weight

accuracy = 0.20:

groupnorm.weight = torch.nn.Parameter(bn.weight)

When I divide weight it will be changed from parameter to tensor, so I need to convert it back to parameter. But doing so, seems breaks something even without changing the value of the weight

bfarzin · March 29, 2019, 3:16pm

I must be missing something obvious here, can you help me to understand.

I would think that you could just do the following, waiting to step and zero_grad till after N batches. Maybe does not work with reduce=mean? Were these two points not available when you wrote this a bit ago?

class AccumulateStep(LearnerCallback):
    """ Does accumlated step every nth step by accumulating gradients """
    def __init__(self, learn:Learner, n_step:int = 5):
        super().__init__(learn)
        self.n_step = n_step
        
    def on_backward_end(self, num_batch,**kwargs):
        if (num_batch % self.n_step) == 0:  self.opt.step()
            
    def on_step_end(self, num_batch, **kwargs):
        if (num_batch % self.n_step) == 0:  self.opt.zero_grad()

kcturgutlu · March 29, 2019, 3:45pm

Stepping and zeroing grad happens automatically whenever on_backward_end and on_step_end returns False. So you would be anyway stepping with this code you’ve written. As for zeroing gradients it’s only necessary when you do the actual accumulated step, so you should also skip zeroing it in order to accumulate the gradients.

You may check loss_batch to see how step and zero_grad works by default:

github.com

fastai/fastai/blob/35b640d45963967e8b0202e7223c054d56f3b1e4/fastai/basic_train.py#L19-L38


def loss_batch(model:nn.Module, xb:Tensor, yb:Tensor, loss_func:OptLossFunc=None, opt:OptOptimizer=None,
           cb_handler:Optional[CallbackHandler]=None)->Tuple[Union[Tensor,int,float,str]]:
"Calculate loss and metrics for a batch, call out to callbacks as necessary."
cb_handler = ifnone(cb_handler, CallbackHandler())
if not is_listy(xb): xb = [xb]
if not is_listy(yb): yb = [yb]
out = model(*xb)
out = cb_handler.on_loss_begin(out)


if not loss_func: return to_detach(out), yb[0].detach()
loss = loss_func(out, *yb)


if opt is not None:
    loss,skip_bwd = cb_handler.on_backward_begin(loss)
    if not skip_bwd:                     loss.backward()
    if not cb_handler.on_backward_end(): opt.step()
    if not cb_handler.on_step_end():     opt.zero_grad()


return loss.detach().cpu()

This is the implementation:

github.com

fastai/fastai/blob/35b640d45963967e8b0202e7223c054d56f3b1e4/fastai/train.py#L99-L133


class AccumulateScheduler(LearnerCallback):
"Does accumlated step every nth step by accumulating gradients"


def __init__(self, learn:Learner, n_step:int = 1, drop_last:bool = False):
    super().__init__(learn)
    self.n_step,self.drop_last = n_step,drop_last
 
def on_train_begin(self, **kwargs):
    "check if loss is reduction"
    if hasattr(self.loss_func, "reduction") and (self.loss_func.reduction != "sum"):
         warn("For better gradients consider 'reduction=sum'")
    
def on_epoch_begin(self, **kwargs):
    "init samples and batches, change optimizer"
    self.acc_samples, self.acc_batches = 0., 0. 
    
def on_batch_begin(self, last_input, last_target, **kwargs):
    "accumulate samples and batches"
    self.acc_samples += last_input.shape[0]
    self.acc_batches += 1

This file has been truncated. show original

We use return {'skip_step':True, 'skip_zero':True} to skip.

Hope this helps

bfarzin · March 29, 2019, 4:56pm

Thanks for pointing this out. You are right , it was calling the stepper on all steps with my code.
Is there a doc page for this new callback, I can’t seem to find it?

I wrote this helper, which I think we should add to the library to allow people to use this callback:

def accum_grad(learn:Learner, n_step:int=1)->Learner:
    "Add accumulation of gradients of `n_step` during training."
    learn.callback_fns.append(partial(AccumulateScheduler, n_step=n_step))
    return learn

Any feedback on that?

kcturgutlu · March 30, 2019, 5:42am

Yes that might be helpful, you may find the callback here: https://github.com/fastai/fastai/blob/fbbc6f91e8e8e91ba0e3cc98ac148f6b26b9e041/fastai/train.py#L99-L134.

But there is no docs I guess. The thing is batchnorm is still a problem with it.

sgugger · March 30, 2019, 12:48pm

That would be because the person that introduced that feature never followed up with docs

kcturgutlu · March 30, 2019, 1:11pm

So sorry for that, it’s 100% my fault . I was hoping to get stable results and fixing batchnorm issue. I should probably create a doc for it explaining what this callback is for and what is it’s limitations.

paul · March 30, 2019, 6:33pm

What do you think is the simplest way to fix the batchnorm issue in a way that doesn’t change too many parts of fastai?

kcturgutlu · March 31, 2019, 10:58am

Probably using instance norm or group norm, but in experiments it didn’t work for every dataset. For example, it worked in case of mnist but not for dog breeds.

sgugger · March 31, 2019, 1:20pm

For BatchNorm, I realize what was wrong with my layer now that we have had to debug a vanilla implementation with Jeremy. I don’t think it’s fixable without using the modified version Jeremy will introduce in next course, but I’ll think about some way to do it (basically the update on training mode is done with the statistics of the batch, not the moving average, the moving average is only used at validation, so we need to find a way to trick BatchNorm into using the stats of the accumulated batches).

paul · April 4, 2019, 4:32am

I understand your comment a bit better after course 10. It may be useful to have a variant that mixes both solutions (running batchnorm and accumulating batchnorm). You might already have it worked out; I will have to sleep over it.

hwasiti · April 15, 2019, 3:55pm

When I watched lesson 10, I said to myself, YES… For proper statistics we should accumulate sums and sums of squares and not the moving average… How clear it is now… This is why studying the basics in part2 of the course is so important… I didn’t quite understood BN, so I didn’t know what was the problem in running BN moving average…

I think my source of confusion is that during validation the BN is working with moving average, right?
But couldn’t we do the validation too, with Jeremy’s modified method in the training mode (using the stats of the accumulated batches)? and why validation is different from training?

Thanks Sylvain for your exceptional efforts and thanks to Jeremy… You haven’t let us down… Jeremy replied 1 month ago that

But here is a great developer who couldn’t let it go until he solved it with you Sylvain…

Now I am trying to use Jeremy’s RunningBatchNorm class with my pets notebook experiments… I have created two classes RunningBatchNorm2d and RunningBatchNorm1d to replace all BN types in resnet18…
I have tried it for couple of hours and modified the resnet18 to include this class instead of BN…
Here is my notebook in nbviewer

I am getting this error when try to run fit or getting the learn.summary()… I will try more to debug it and let you know how things will go… Just in case anybody have an idea, please let me know… I think this is something related to different dimensions arrangements with our modified class and the normal BN…

RuntimeError                              Traceback (most recent call last)
<ipython-input-58-bc39e9e85f86> in <module>
----> 1 learn.summary()

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callbacks/hooks.py in model_summary(m, n)
    164 def model_summary(m:Learner, n:int=70):
    165     "Print a summary of `m` using a output text width of `n` chars"
--> 166     info = layers_info(m)
    167     header = ["Layer (type)", "Output Shape", "Param #", "Trainable"]
    168     res = "=" * n + "\n"

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callbacks/hooks.py in layers_info(m)
    158     func = lambda m:list(map(get_layer_name, flatten_model(m)))
    159     layers_names = func(m.model) if isinstance(m, Learner) else func(m)
--> 160     layers_sizes, layers_params, layers_trainable = params_size(m)
    161     layer_info = namedtuple('Layer_Information', ['Layer', 'OutputSize', 'Params', 'Trainable'])
    162     return list(map(layer_info, layers_names, layers_sizes, layers_params, layers_trainable))

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callbacks/hooks.py in params_size(m, size)
    146     with hook_outputs(flatten_model(m)) as hook_o:
    147         with hook_params(flatten_model(m))as hook_p:
--> 148             x = m.eval()(*x) if is_listy(x) else m.eval()(x)
    149             output_size = [((o.stored.shape[1:]) if o.stored is not None else None) for o in hook_o]
    150             params = [(o.stored if o.stored is not None else (None,None)) for o in hook_p]

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
    318     def forward(self, input):
    319         return F.conv2d(input, self.weight, self.bias, self.stride,
--> 320                         self.padding, self.dilation, self.groups)
    321 
    322 

RuntimeError: Given groups=1, weight of size [128, 64, 1, 1], expected input[1, 128, 28, 28] to have 64 channels, but got 128 channels instead

MicPie · May 7, 2019, 7:25pm

Did you tried to run some dummy tensor with the right shape through your adapted BN layer?

Something like this but with the BN layer:

conv_layer = nn.Conv2d(1,16,3)
x = torch.randn(10, 28, 28)
conv_layer(x).shape
# this works too without creating a nn.Conv2d instance: nn.Conv2d(1,16,3)(x)

The problem is with errors in models that don’t have a “line-by-line” forward method that the stack trace is quite confusing…

Another way to debug this is to insert a debugger layer (however the option from able should be easier).

kcturgutlu · July 24, 2019, 7:27am

Running BatchNorm is not a good idea as I understand from the BatchReNorm paper. I also tried it on a recent kaggle competition and it behaved exactly the way they described in the paper. Please, let me know if anyone got it to work though

This section: