After a month or so of experimenting on wikitext-2 and training a LM as fast as possible, I wanted to share a few findings. Here is the notebook with my current results: in 150 epochs (instead of 750) I get to a perplexity of 70.73 (vs 68 for the benchmark given by Stephen Merity) then 53.1 with the cache pointer (vs 52 or the benchmark given by Stephen Merity).
Here is a list of things that helped:
- using the dropouts from Stephen Merity and not Jeremy’s (I think this part comes from the different Tokenizer since Jeremy’s dropouts worked best in my experiments with imdb/wiki-103).
- gradient clipping: this allows for very high learning rates and after trying a bunch of different values, 0.12 seemed to work best.
- 1cycle (of course) with high learning rate: specifically, the minimum of the curve given by the LR Finder (and not one tenth of it as we usually do) is the best value in this case.
- AR/TAR regularization helped for a few last points in the end, but only for the raw model (cache pointer will give the same results without it)
- a longer time annealing the LR at the end (here 33% of the budget seemed best)
What I tried and didn’t work:
- relaxing the regularization (dropouts, weight decay, AR/TAR) during the part with high learning rates. I tried reducing it like the momentum, from 100% to 80%, 50% or 0%, every combination of the three regs, but it didn’t yield better results
I have a few more suggestions from Jeremy to try, but that’s already a nice first step toward training language models fast.