Accumulating Gradients

@kcturgutlu
I am fiddling with your notebook…

The function find_active_bn is missing…
I suppose it is in data_utils , right?

Can you upload it to github?

So here is how I wrapped all BN layers.
Is there a more efficient way to find all BN layers in a model other than multiple nested for loops?

def change_all_BN(module):
    for i in range(5):
        atr = 'bn'+str(i)
        if hasattr(module, atr):
            setattr(module,atr,AccumulateBatchNorm(getattr(module,atr)))


def wrap_BN(model):
    for i in range(len(model)):
        for j in range(len(model[i])):
            if isinstance(model[i][j], bn_types):
                model[i][j] = AccumulateBatchNorm(model[i][j])
            elif model[i][j].__class__.__name__ == "Sequential":
                for k in range(len(model[i][j])):
                    if isinstance(model[i][j][k], bn_types):
                        model[i][j][k] = AccumulateBatchNorm(model[i][j][k])
                    elif model[i][j][k].__class__.__name__ == "BasicBlock":
                        change_all_BN(model[i][j][k])
                        if hasattr(model[i][j][k],'downsample'):
                            if model[i][j][k].downsample is not None:
                                for l in range(len(model[i][j][k].downsample)):
                                     if isinstance(model[i][j][k].downsample[l], bn_types):
                                        model[i][j][k].downsample[l] = AccumulateBatchNorm(model[i][j][k].downsample[l])
                               

So with the BN layers wrapped, GPU memory keeps increasing until COM error. I suspect that the wrapped BN parameters are kept in and accumulated for all forward pass iterations in the GPU.
I will look further into it later.

Here is my code:

pat = re.compile(r'/([^/]+)_\d+.jpg$')
data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=224, bs=BS
                                  ).normalize(imagenet_stats)
def get_learner():
    turn_on_accumulation()
    learn = create_cnn(data=data, arch=models.resnet34, metrics=error_rate,
                       callback_fns=[partial(AccumulateStep, n_step=N_STEP)])
    learn.loss_func = CrossEntropyFlat(reduction="sum")
    return learn

learn = get_learner() 
wrap_BN(learn.model)
learn.fit_one_cycle(1)

You can use the new callback in the master which doesn’t require turn_on_.... Also I tried several things to make it work but it’s either backward error or pickle error. I am updating the repo, the link is already shared here so not sharing again.

I am also trying to understand torch.batch_norm code, which only takes running_mean and running_var but not the batch mean and var. I guess that means batch mean and var are calculated inside that function if training=True otherwise running stats are used. But this is probably not something we want. We want to normalize each sample with the batch stats, but since we don’t know the upcoming samples in accumulation and can’t temporarily hold them due to memory issues there might be couple of solutions. Either to do 2 x each epoch to calculate batch stats as it’s done in Kaggle competition shared here or we can accumulate batch stats (batch mean var) to be used for the current batch (this would be an approximation as we come closer to the end of batch). I tried implementing the latter one in my repo but there are some issues to be fixed. Appreciate everyone’s help :slight_smile:

2 Likes

So I conducted some experiments as following:

model = vgg16_bn (chosen since it’s sequential - easy to manipulate)
data = MNIST_SAMPLE

Experiment Results

  1. No Accumulation
    batch_size = 64
    acc = 0.94

  2. Naive Accumulation
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.49

  3. Accumulation + BnFreeze
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.59

  4. Increase BN Momentum (on current batch stat)
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.49

  5. Decrease BN Momentum (on current batch stat)
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.57

  6. Replace BN with Instance Norm
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.57

  7. Replace BN with Group Norm
    num_groups=4
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.98

  8. ResNet18 + Replace BN with Group Norm
    num_groups=4
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.99

  9. ResNet18 + Replace BN with Group Norm no Accumulation
    num_groups=4
    bs = 2
    acc = 0.99

GroupNorm seems to work pretty good. But we can’t be sure without trying it out on resnet variants and on different datasets.

notebook : https://github.com/KeremTurgutlu/experimental/blob/master/Accumulating_Batchnorm.ipynb

Here is the group norm paper: https://arxiv.org/abs/1803.08494

@hwasiti maybe you can do the same experiments this time converting all bn layers to group norm like I did in my notebook

Edit

3 Likes

Please note that v1.0.47 will have a breaking change that affects this callback (see announcement in the developer chat). To skip the step and the grad zeroing, just return:

return {'skip_step': True, 'skip_zero': True}

wherever is more convenient (probably in on_backward_end)

1 Like

Created a new pull request with current changes.

It seems like converting BN to GroupNorm is enough even without accumulating gradients. You may check the notebooks I’ve shared

That’s very interesting @kcturgutlu . I have tried ~2 weeks ago Group Norm and Instance Norm on the pet’s notebook with resnet50 and the acc was not good…

I will try again using snippets from your code and report back…

Instance norm didn’t work for me as well. Yeah maybe do groupnorm+acc and groupnorm+no acc

I have repeated your notebook as it is on MNIST dataset, and yes Group Norm worked impressively well…
However, when I changed the data into the Pet’s dataset, it didn’t work…

Here are the results and the modified notebook:

model = vgg16_bn (chosen since it’s sequential - easy to manipulate)
data = Pet’s dataset

Experiment Results

  1. No Accumulation
    batch_size = 64
    acc = 0.90

  2. Naive Accumulation
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.63

  3. Accumulation + BnFreeze
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.63

  4. Increase BN Momentum (on current batch stat)
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.21

  5. Decrease BN Momentum (on current batch stat)
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.65

  6. Replace BN with Instance Norm
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.03

  7. Replace BN with Group Norm
    num_groups=4
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.07

  8. ResNet18 + Replace BN with Group Norm
    num_groups=4
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.03

  9. ResNet18 + Replace BN with Group Norm no Accumulation
    num_groups=4
    bs = 2
    acc = 0.02

I suspect that the MNIST dataset is too simple. Perhaps it is already normalized, like what the the Human Protein Atlas comp winner did (@pudae), when he normalized each image alone before passing them to the model, which is of course weird, and don’t think can be generalized to other datasets.

notebook : https://github.com/hwasiti/fastai-course-v3/blob/master/nbs/dl1/Accumulating_Batchnorm%20(PETS)-v2.ipynb

1 Like

I have tried different runs with num_groups=1,2,4,8,16,32,64
seems the best is 64 :

  1. Replace BN with Group Norm
    num_groups=4
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.36

  2. ResNet18 + Replace BN with Group Norm
    num_groups=4
    effective_batch_size = 64
    step = 32
    bs = 2
    acc = 0.12

  3. ResNet18 + Replace BN with Group Norm no Accumulation
    num_groups=4
    bs = 2
    acc = 0.44

I wonder whether groupnorm.weight, groupnorm.bias, groupnorm.eps need to be tuned for each dataset and not simply copy it from BN?

I am testing changing them one by one

Why I am getting different accuracy with this change:

accuracy = 0.36:

groupnorm.weight = bn.weight

accuracy = 0.20:

groupnorm.weight = torch.nn.Parameter(bn.weight)

When I divide weight it will be changed from parameter to tensor, so I need to convert it back to parameter. But doing so, seems breaks something even without changing the value of the weight

I must be missing something obvious here, can you help me to understand.

I would think that you could just do the following, waiting to step and zero_grad till after N batches. Maybe does not work with reduce=mean? Were these two points not available when you wrote this a bit ago?

class AccumulateStep(LearnerCallback):
    """ Does accumlated step every nth step by accumulating gradients """
    def __init__(self, learn:Learner, n_step:int = 5):
        super().__init__(learn)
        self.n_step = n_step
        
    def on_backward_end(self, num_batch,**kwargs):
        if (num_batch % self.n_step) == 0:  self.opt.step()
            
    def on_step_end(self, num_batch, **kwargs):
        if (num_batch % self.n_step) == 0:  self.opt.zero_grad()

Stepping and zeroing grad happens automatically whenever on_backward_end and on_step_end returns False. So you would be anyway stepping with this code you’ve written. As for zeroing gradients it’s only necessary when you do the actual accumulated step, so you should also skip zeroing it in order to accumulate the gradients.

You may check loss_batch to see how step and zero_grad works by default:

This is the implementation:

We use return {'skip_step':True, 'skip_zero':True} to skip.

Hope this helps

Thanks for pointing this out. You are right , it was calling the stepper on all steps with my code.
Is there a doc page for this new callback, I can’t seem to find it?

I wrote this helper, which I think we should add to the library to allow people to use this callback:

def accum_grad(learn:Learner, n_step:int=1)->Learner:
    "Add accumulation of gradients of `n_step` during training."
    learn.callback_fns.append(partial(AccumulateScheduler, n_step=n_step))
    return learn

Any feedback on that?

1 Like

Yes that might be helpful, you may find the callback here: https://github.com/fastai/fastai/blob/fbbc6f91e8e8e91ba0e3cc98ac148f6b26b9e041/fastai/train.py#L99-L134.

But there is no docs I guess. The thing is batchnorm is still a problem with it.

1 Like

That would be because the person that introduced that feature never followed up with docs :wink:

1 Like

So sorry for that, it’s 100% my fault :frowning: . I was hoping to get stable results and fixing batchnorm issue. I should probably create a doc for it explaining what this callback is for and what is it’s limitations.

1 Like

What do you think is the simplest way to fix the batchnorm issue in a way that doesn’t change too many parts of fastai?