That’s the question. Weight updates are done every n_acc minibatches, and presumably are larger by a factor of n_acc. Would you therefore adjust the lr by 1/n_acc to get the equivalent training as if without GradientAccumulation?
P.S. And what about BatchNorm? Seems like it would misbehave.
If you’re adding the gradients rather than averaging them, you do need to divide your learning rate by
#samples you accumulate/#batch size. Concretely, say you would normally fit your model with a batch size of 32 and a learning rate of 4e-4, but due to memory issues, you can only afford a batch size of 8. You would then have to feed the model batch sizes of 8, accumulate the gradients of four batches, and train with a learning rate of 4e-4/(32/8) = 1e-4. I suggest you read this article to get more familiar with gradient accumulation and the various ways to perform it.
Batch norm doesn’t work well with gradient accumulation because its stats aren’t accumulated. You can either use another type of normalization (like group norm), add tweaks like weight standardization, or not accumulate gradients at all and go with a small batch size (smaller batch sizes actually frequently result in better accuracy at the cost of more training time).
Have a nice weekend!