Just do like me: in the notebook 10_nlp.ipynb of Jeremy, put cbs=GradientAccumulation() in learn.fit_one_cycle(). You should observe the huge running training loss (see my first screen shots)…
In fact, when you go until the end of the learner training (always the same example: the notebook 10_nlp.ipynb, see my second screen shot), you will observe that the valid loss and accuracy are right (compared to what Jeremy got) but the training loss is well high.
What I think:
GradientAccumulation()works well.- but… the running training loss up to the final one shows a true value but not the average one
How to correct the last point?