Hey fast.ai community members!
We finished a small but useful extension of the original ULMFiT paper exploring how varying the amount of available unlabeled domain data impacted the accuracy of the language task. The headline takeaways from the work are:
- 75% of the accuracy boost found in the original paper can be achieved with about a third of the unlabeled data
- It confirms the intuition that using any domain data to extend a ULM to be domain specific is better than a the ULM on its own and ULM + Domain is always better than a Domain only model.
Check out the blog post and linked repo (based on fast.ai v1 & pytorch v1), also special thanks to @sebastianruder and @jeremy for their great initial work and enabling library!