I’ve been doing some language modeling for a research project and wanted to share some progress I’ve made that I find pretty exciting. One of the aspects for my project included implementing a decoder-only Transformer language model, and I wanted to compare it to the AWD LSTM fastai language model discussed in Jeremy and Sylvain’s recent post about AdamW and superconvergence. I used superconvergence to train my model, and only spent about an afternoon tuning my model with the fastai framework (which I have fallen hopelessly in love with at this point). As you can see in the below table, I achieved far better results in much less time, partially because I needed even less epochs and partially because Transformers are convolutional instead of recurrent.
I’ve been super impressed by all the work by Jeremy and the rest of the team. Using fastai for this project has vastly improved my ability to iterate and improve my models. Keep up the great work guys!