Super convergence(ish) on wikitext-2

Ah, I see. So accuracy and loss are still a good approximation of the language model performance during training, but for more concrete results perplexity is what matters.

Also, thanks for the script! An Adam-W implementation was much needed.