I cannot understand the difference between ULMFiT and GPT’s training recipe. Both seem to predict the next word in an autoregressive fashion, that is, predict the next word in a sequence, and I was wondering if there are any major differences? Of course, GPT uses a transformer, but I’m referring specifically to the training step and not the network architecture.
Yeah GPT’s main contribution was using a Transformer instead of the LSTM but you wouldn’t be wrong to say that ULMFiT paved the way for the pretraining + fine-tuning approach for language models.
[Screenshot from the GPT Paper]
They also used a different corpus for the pretraining bit: BooksCorpus while ULMFiT used Wikitext-103
Ah that makes sense, thank you!