Super convergence(ish) on wikitext-2

sgugger · May 27, 2018, 1:03pm

After a month or so of experimenting on wikitext-2 and training a LM as fast as possible, I wanted to share a few findings. Here is the notebook with my current results: in 150 epochs (instead of 750) I get to a perplexity of 70.73 (vs 68 for the benchmark given by Stephen Merity) then 53.1 with the cache pointer (vs 52 or the benchmark given by Stephen Merity).

Here is a list of things that helped:

using the dropouts from Stephen Merity and not Jeremy’s (I think this part comes from the different Tokenizer since Jeremy’s dropouts worked best in my experiments with imdb/wiki-103).
gradient clipping: this allows for very high learning rates and after trying a bunch of different values, 0.12 seemed to work best.
1cycle (of course) with high learning rate: specifically, the minimum of the curve given by the LR Finder (and not one tenth of it as we usually do) is the best value in this case.
AR/TAR regularization helped for a few last points in the end, but only for the raw model (cache pointer will give the same results without it)
a longer time annealing the LR at the end (here 33% of the budget seemed best)

What I tried and didn’t work:

relaxing the regularization (dropouts, weight decay, AR/TAR) during the part with high learning rates. I tried reducing it like the momentum, from 100% to 80%, 50% or 0%, every combination of the three regs, but it didn’t yield better results

I have a few more suggestions from Jeremy to try, but that’s already a nice first step toward training language models fast.

urmas.pitsi · May 31, 2018, 12:00pm

i wonder what went wrong, I see too good results on my screen, acc=98.6% and perplexity=1.02. Is there any good way to test whether language model is really as good as numbers tell? I am trying to predict on arbitrary sentences, but not sure how to do it properly.

sgugger · May 31, 2018, 12:42pm

Those numbers look way to good to be true, there must be a bug somewhere

sooheon · June 2, 2018, 2:17pm

Thanks for the code, I’m trying out the Adam variant as well. I’m working on Korean, so I’ve made my own corpus, and for a 500k sentence subset of it, one epoch takes ~20mins. I see wiki.train.tokens is ~36k sentences long, about how long does an epoch take for you? 90 epochs would be 30 hrs for me.

sgugger · June 2, 2018, 6:05pm

It also depends on your vocab size since the last softmax is very heavy computation-wise. Wikitext-2 has a vocab size around 30k words and one epoch takes me roughly 1 minute or 2 (depending on the GPU).

sooheon · June 2, 2018, 6:54pm

Cool, thanks.

I think I’ll make two datasets parallel to wikitext2 and 103, a smaller one for iterating and finding good hyperparams on.

How would you go about adjusting hyperparams from 2->103? I know in the Leslie N. Smith paper it’s emphasized that regularization must be balanced for the data and the architecture… increasing data size by 1000x should change your hyperparams, correct?

sgugger · June 2, 2018, 10:11pm

Stephen Merity has different values for dropout for instance, between wt 2 and 103. Since there is so much more data, the model will overfit less and less regularization is needed. Best values of dropout will be different for instance, probably weight decay too.
That’s why I’m trying to make superconvergence work because that way, a training doesn’t take that much time (even on a big dataset) and then you can experiment. Hopefully then, hyper-parameters will stay the same for other big datasets in different languages so that it’s easy to train different language models.

sooheon · June 3, 2018, 6:59am

@sgugger Are you aware of something like lr_find for momentum and weight decay? I’m digesting Leslie Smith’s paper, but the bare bones seems to come down to:

lr search: set a max lr
batch size search: as much as gpu can take is good default
momentum search: find good momentum, decrease it whenever you increase lr to counterbalance
wd search: adjust to architecture depth and dataset complexity

sgugger · June 3, 2018, 1:14pm

Sadly not. Finding a quick way to set all the hyper-parameters would be so nice, but I haven’t found anything useful yet.

for momentum, the usefulness of the range (usually (0.95,0.85)) is that you don’t have to bother finding a good value. As Lesie said in his article, the best momentum value would perform as well, but by using cyclical momentum you don’t have to bother fine-tuning. You can always test (0.9,0.8) or (0.99,0.85) to see if it improves the final results but (0.95,0.85) has worked well for me (except for Adam on RNNs where (0.8,0.7) seems better).
for weight decay, I’m not sure I have understood properly the way Leslie uses several LR Finder to pick it. If you have found a rule that works, don’t hesitate to share

Leslie · June 4, 2018, 9:26am

If weight decay is “large” than the max learning rate must be small. Since large learning rates regularize, one needs less regularization from weight decay. I suggested running the LR finder at a few weight decay values to determine when you can use larger learning rates.

Does this help?

sgugger · June 4, 2018, 1:05pm

It does, and thanks for your reply.
In practice though, I have often found that it’s difficult to see if the changes in the LR Finder graph are due to random noise or the importance of weight decay. Maybe I should average a few of them to get cleaner plots.

adilism · June 14, 2018, 4:35pm

Thank you, the notebook and the discussion have been extremely helpful. How did you decide on the cycle length, i.e. the number of epochs? Given that the whole training process is one long cycle this seems to become another parameter to tune.

sgugger · June 14, 2018, 5:01pm

I didn’t have any trick to pick up the length, it’s mostly through several experiments and picking a point where increasing it doesn’t really give a better result.
I’ll put an updated version soon, but I know have better results with plain Adam in 90 epochs and a customized 1cycle policy.

vova · June 22, 2018, 7:27pm

@sgugger
You might find this interesting - How Can Neural Network Similarity Help Us Understand Training and Generalization

We also found that trained networks with identical structures but different learning rates converge to distinct clusters with similar performance, but highly dissimilar representations.

If learning rate affects representations that model learns, it would be very interesting to see how dynamic learning rate changes the picture.

aayushy · July 16, 2018, 8:39am

This is a great resource, thanks!

A question: from what I understand you’re using this for pre-training your language model on the Wikipedia corpus – for which the authors originally used a pretty straightforward training schedule. Going by your notebook, the accuracy is ~30%, which is not very high.

Not having trained LMs before, I have no objective way of knowing whether that is “good”/“bad”. Would you say that that is usually what the LM is able to achieve.

Or is accuracy not a good metric in this case? Should perplexity be considered a better metric in this situation?

sgugger · July 16, 2018, 12:00pm

Accuracy isn’t the metric usually used in publications for LM, it’s perplexity that’s used in benchmarks. On this dataset, we got real super convergence by using Adam/AdamW now (to get the same perplexity as the authors that invented this model, 68.7) with this script (refer to the README in the repo for the choice of hyper-parameters).

Getting to 30% is really good already in the sense that it means the model got the next work (among 30K possibility here!) exactly right 30% of the time. The finetuned model that is used to get 94.7-94.8% accuracy on imdb has an accuracy (as a LM) of 31.1% for instance.

aayushy · July 16, 2018, 12:10pm

Ah, I see. So accuracy and loss are still a good approximation of the language model performance during training, but for more concrete results perplexity is what matters.

Also, thanks for the script! An Adam-W implementation was much needed.