A little off-topic, but I wonder if I can use the idea of a custom OptWrapper to get per sample loss gradient w.r.t the weights (basically remove reduction = sum) and then, turn the reduction back on_backward_begin. I was trying to do it all with callbacks, but I dont know how to get the loss gradient in on_backward_begin and change it to the rest of gradient calculations.
I changed all the BN layers of a Resnet50 Unet by your AccumulateBatchNorm module (63 replacements); and it exhausts the CUDA memmory.
Do you know if there is any memory leak?
Using batch_size of 1, it cant start a single epoch
Iāve read the code twice, and I canāt understand why.
Iāve tried replacing by GroupNorm and there is no memory leakage at all
TIA
I havenāt tested it so itās very possible there is some bug/memory leak.
Does it even finish one iteration ? Because if it is not the case I doubt it is memory leak, it probably just stores too much information in memory at once for some reason (maybe the computation graph gets quite complicated with this method).
hi @sgugger,
You said that you have removed all batchnorm from Unet but I still see batchnorm in DynamicUnet() class in fastai source code.
Getting really high train/validation loss with AccumulateScheduler
ā¦ is this normal because of the āsumā reduction?
Train Loss=1044.117188
Validation Loss=930.394165
EDIT:
The issue has to do with Iām attempting to use this in a language model classification problem ā¦ where there is a probability calculated for each actual token. I tried the below but itās still reporting the large losses I mention above.
class AbSumAccumulateScheduler(AccumulateScheduler):
def on_batch_begin(self, last_input, last_target, **kwargs):
"accumulate samples and batches"
self.acc_samples += last_input[0].shape[0]
self.acc_batches += 1
def on_backward_begin(self, last_target, last_loss, **kwargs:Any):
#pdb.set_trace()
n_predicted_tokens = len(last_target[last_target != -1])
last_loss = last_loss / n_predicted_tokens
return { 'last_loss': last_loss }
Are the loss results in the expected range if you divide them by the batch size (2x batch size for valid, if you use the standard fastai setup)?
@kcturgutlu, thanks for putting your time into this and sharing with the community! thanks @hwasiti and @sgugger for testing and supporting the development. Iām generally really amazed with the impact fastai have had on me, so cudos to your involvement!
I am still new to python, pytorch and fastai, but have tested the scheduler callback successfully.
I just wanted to clarify:
If I end up changing loss functions (i.e. not just using crossentropy w/ reduction=sum), the main importance is that I sum over the loss of each batch before reduction? as part of my adventure into fastai, i will work on implementing some different loss functions for segmentation and Iām just worried accumulation will add more complexity to this.
Will this way of accumulating gradients have a bad interaction with mixup? (i understand it may not be simple to just answer).
thanks and happy newyear to you and this thread!
I beleive that there is a good solution for fastai 1
https://www.kaggle.com/iafoss/hypercolumns-pneumothorax-fastai-0-831-lb
@sgugger
could u help clarify
-
why do we need to average out grad weights,is it necessary
-
in many floating implementation ,i just see people just do step no averaging.
-
Should we average out the loss also,will it have any impact on grads ?
model.zero_grad() # Reset gradients tensors for i, (inputs, labels) in enumerate(training_set): predictions = model(inputs) # Forward pass loss = loss_function(predictions, labels) # Compute loss function loss = loss / accumulation_steps # Normalize our loss (if averaged) loss.backward() # Backward pass if (i+1) % accumulation_steps == 0: # Wait for several backward steps optimizer.step() # Now we can do an optimizer step model.zero_grad() # Reset gradients tensors if (i+1) % evaluation_steps == 0: # Evaluate the model when we... evaluate_model()
I donāt think thatās right is it? Momentum impacts running stats, not bn parameters. bn parameters are impacted by gradient accumulation already afaict.