I know … the NLP hypetrain has moved on to HuggingFace and transformers - but I still love ULMFiT.
fastai and ULMFiT are still great - easy to use, fast to train and imho still SOTA for classification. But what was missing (esp. for fastai2 with SentencePiece) were easy to use pretraining scripts and pretrained models. Thats what I am trying to fix with this repo .
About 4 months ago I published a repository to pretrain a German ULMFiT with SentencePiece from scratch - based on information on the forums and the nlp-course notebooks for Turkish and Vietnamese. The repo didn’t contain any documentation and I got a couple of questions and requests which made me think that there is still some interest in ULMFiT.
So despite the (100% legitimate) huggingface-hype I put some time and effort in improving the repo, the pretraining script and also trained some ULMFiT models (links in the repo).
German
Dutch
Russian
Portuguese
Vietnamese
Japanese
Italian
Spanish
Korean
Thai
Hebrew
Arabic
All Models use SentencePiece, vocab size 15.000 (seems small but there was no improvement on the downstream tasks for the datasets I evaluated the models) and have a forward and backward model.
Pretraining requires about 4 hours (1h wikipedia download an preparation + 3 hours training on a RTX 3090) and less than 8 GB GPU RAM with batch size 128.
I validated most on the models on classification datasets and tried to compare the results with BERT papers without a lot of hyper parameter tuning - sometimes my models were a bit better than BERT, sometimes they performed slightly worse. I am sure there is room for improvement - if you find better hyper parameters or think other vocal-sizes work better with some languages please let me know - but I believe the models are a good staring point.
To improve the usability of the pretrained-models I created a small library with some helper functions to create the tokenizer and learner. See this notebook (Colab) for how to use the models:
If you have any questions or comments please let me know and I hope this is still helpful :).
I’ve always found it troubling that when BERT came out, nobody did a head-to-head comparison with ULMFiT. BERT has far more parameters and training data than ULMFiT, and consumes far more compute to pre-train than ULMFiT. I wish someone had scaled-up ULMFiT pretraining so it would be a fair comparison. Sadly we mostly can’t do that experiment because we don’t have the compute resources of Google. I build a BERT-from-scratch on colab, but I had a much smaller pre-training dataset than Google BERT. I wish Google would donate the compute to run that experiment. Has anyone approached them?
IIRC overall ULMFiT is still better in generalization scenarios with classification (we still use ULMFiT over BERT at work), though BERT is useful in certain situations
Awesome initiative Florian! I 100% share your opinion about ULMFiT
Wow, that’s really fast. Is that the case for all languages you’ve trained on? I am wondering because I thought the wikipedia data set sizes vary quite a bit across languages.
If they are comparable in typical performance measures, then why didn’t either Smerity or Sebastian Ruder defend it. Everyone seems to say that transformers have more context memory of longer strings of tokens, but with the computational cost of attention (proportional to the square of the number of tokens), I see BERT being weak in that regard. In any event, glad to year you are using ULMFiT.
No the training times were different between the languages and it was not always easy to find the right number of “words” (for example Japanese oder korean). I tried to adjust the number of articles and the minimum length so I get close to 18 minutes training time per epoch (160k German articles which I verified that I got good results). For some languages there were not enough articles in the dump so the dataset were smaller / training times were faster. Also I had the impression that too short articles harm the performance … so there’s a bit of a tradeoff.
At the end I think all models I trained work quite well but I didn’t train on “tiny” wikipedias.
I’ve searched around a bit but might have missed something - does anyone know of any pretrained sentencepiece LMs for English? If not @florianl is this something you’d want a pull request to add to your repository?