I cannot understand the difference between ULMFiT and GPT’s training recipe. Both seem to predict the next word in an autoregressive fashion, that is, predict the next word in a sequence, and I was wondering if there are any major differences? Of course, GPT uses a transformer, but I’m referring specifically to the training step and not the network architecture.

1 Like

Yeah GPT’s main contribution was using a Transformer instead of the LSTM but you wouldn’t be wrong to say that ULMFiT paved the way for the pretraining + fine-tuning approach for language models. :smile:

[Screenshot from the GPT Paper]

They also used a different corpus for the pretraining bit: BooksCorpus while ULMFiT used Wikitext-103


Ah that makes sense, thank you!

1 Like