There is a new architecture for transfer learning in NLP released by Google AI called BERT: https://www.reddit.com/r/MachineLearning/comments/9nfqxz/r_bert_pretraining_of_deep_bidirectional/
I think it is worth checking out, as it boasts SOTA results on a range of tasks with minimal architecture change. On the other hand, the paper says the models are huge - 110M and 340M parameters for a small and large model respectively. However, after Google releases it, the language modelling and transfer learning approach can really take off.
What do you guys think?