Fastai Transformer and TransformerXL models

@Esteban - not sure you are still interested in this as it has been several months. I’m still learning too, but it seems that the Transformer architecture fits most closely into the transfer learning approach of ULMFit as a pretrained model has been made available. TransformerXL (as of now at least) does NOT have a pretrained model, so you have to go through a lot more work and GPU time to get to a nice starting point.

These other posts have quite a few comments and suggestions on how to use the Transformer* architectures. From what I’ve experienced and what people have posted, they require different hyper parameters and behave quite differently (training time, how accuracy changes over time, etc)