Transfer Learning and Size of Unlabeled Domain Corpus

Hey community members!

We finished a small but useful extension of the original ULMFiT paper exploring how varying the amount of available unlabeled domain data impacted the accuracy of the language task. The headline takeaways from the work are:

  • 75% of the accuracy boost found in the original paper can be achieved with about a third of the unlabeled data
  • It confirms the intuition that using any domain data to extend a ULM to be domain specific is better than a the ULM on its own and ULM + Domain is always better than a Domain only model.

Check out the blog post and linked repo (based on v1 & pytorch v1), also special thanks to @sebastianruder and @jeremy for their great initial work and enabling library!


Hi Jesse, thanks for sharing! Nice work! :smiley:

1 Like

Jesse, this is a great blog and data set. From your data graphs, I had some additional conclusions:

  • As soon as you add your own unlabeled data (equivalent to the first training epoch after “unfreeze”) you get a huge improvement (double?) in accuracy over ULMFiT model alone.

  • It appears that if you have enough unlabeled data and you keep training long enough, your accuracy will catch up to the accuracy of the ULMFiT + your unlabeled data. I guess this is just common sense; the big value of ULMFiT is when you don’t have a ton of GPU time or unlabeled data.