Hello @boris,
I tried to use GradientAccumulation (with fastai v2) annouced by @muellerzr in a nlp learner (ie: learner with a language model as done in the notebook of @morgan that I fit with fit_one-cycle()). The loss is the normal cross-entropy loss for Language Model.
a = dsl.bs*k
learn.fit_one_cycle(epochs, lr_max, cbs=GradientAccumulation(n_acc=a))
As you can see, my n_acc is a multiple (k) of the batch size (bs).
- When
k = 1(no accumulating gradient), my learner works like it did not have thecbswith a training loss of about 2. - When
k = 2(accumulating gradient within 2 batches) or more, the training loss explodes (about 50 000) and decreases very slowly (ie, even after many weights updates of the model).
I saw that @wgpubs had this problem last year (but I guess, with fastai v1) but I don’t know if he resolved it.
Do you think that the class GradientAccumulation() does not work well with NLP models?
Thanks in advance to anyone with a suggestion.