I know … the NLP hypetrain has moved on to HuggingFace and transformers - but I still love ULMFiT.
fastai and ULMFiT are still great - easy to use, fast to train and imho still SOTA for classification. But what was missing (esp. for fastai2 with SentencePiece) were easy to use pretraining scripts and pretrained models. Thats what I am trying to fix with this repo .
About 4 months ago I published a repository to pretrain a German ULMFiT with SentencePiece from scratch - based on information on the forums and the nlp-course notebooks for Turkish and Vietnamese. The repo didn’t contain any documentation and I got a couple of questions and requests which made me think that there is still some interest in ULMFiT.
So despite the (100% legitimate) huggingface-hype I put some time and effort in improving the repo, the pretraining script and also trained some ULMFiT models (links in the repo).
All Models use SentencePiece, vocab size 15.000 (seems small but there was no improvement on the downstream tasks for the datasets I evaluated the models) and have a forward and backward model.
Pretraining requires about 4 hours (1h wikipedia download an preparation + 3 hours training on a RTX 3090) and less than 8 GB GPU RAM with batch size 128.
I validated most on the models on classification datasets and tried to compare the results with BERT papers without a lot of hyper parameter tuning - sometimes my models were a bit better than BERT, sometimes they performed slightly worse. I am sure there is room for improvement - if you find better hyper parameters or think other vocal-sizes work better with some languages please let me know - but I believe the models are a good staring point.
To improve the usability of the pretrained-models I created a small library with some helper functions to create the tokenizer and learner. See this notebook (Colab) for how to use the models:
If you have any questions or comments please let me know and I hope this is still helpful :).