What is the v2 equivalent of "AccumulateScheduler"?

pierreguillou · October 31, 2020, 8:14pm

I tried to use GradientAccumulation (with fastai v2) annouced by @muellerzr in a nlp learner (ie: learner with a language model as done in the notebook of @morgan that I fit with fit_one-cycle()). The loss is the normal cross-entropy loss for Language Model.

a = dsl.bs*k
learn.fit_one_cycle(epochs, lr_max, cbs=GradientAccumulation(n_acc=a))

As you can see, my n_acc is a multiple (k) of the batch size (bs).

When k = 1 (no accumulating gradient), my learner works like it did not have the cbs with a training loss of about 2.
When k = 2 (accumulating gradient within 2 batches) or more, the training loss explodes (about 50 000) and decreases very slowly (ie, even after many weights updates of the model).

I saw that @wgpubs had this problem last year (but I guess, with fastai v1) but I don’t know if he resolved it.

Do you think that the class GradientAccumulation() does not work well with NLP models?
Thanks in advance to anyone with a suggestion.