ULMFiT pretraining + pretrained models

I know … the NLP hypetrain has moved on to HuggingFace and transformers - but I still love ULMFiT.

fastai and ULMFiT are still great - easy to use, fast to train and imho still SOTA for classification. But what was missing (esp. for fastai2 with SentencePiece) were easy to use pretraining scripts and pretrained models. Thats what I am trying to fix with this repo :slight_smile: .

About 4 months ago I published a repository to pretrain a German ULMFiT with SentencePiece from scratch - based on information on the forums and the nlp-course notebooks for Turkish and Vietnamese. The repo didn’t contain any documentation and I got a couple of questions and requests which made me think that there is still some interest in ULMFiT.

So despite the (100% legitimate) huggingface-hype I put some time and effort in improving the repo, the pretraining script and also trained some ULMFiT models (links in the repo).

  • German
  • Dutch
  • Russian
  • Portuguese
  • Vietnamese
  • Japanese
  • Italian
  • Spanish
  • Korean
  • Thai
  • Hebrew
  • Arabic

All Models use SentencePiece, vocab size 15.000 (seems small but there was no improvement on the downstream tasks for the datasets I evaluated the models) and have a forward and backward model.

Pretraining requires about 4 hours (1h wikipedia download an preparation + 3 hours training on a RTX 3090) and less than 8 GB GPU RAM with batch size 128.

I validated most on the models on classification datasets and tried to compare the results with BERT papers without a lot of hyper parameter tuning - sometimes my models were a bit better than BERT, sometimes they performed slightly worse. I am sure there is room for improvement - if you find better hyper parameters or think other vocal-sizes work better with some languages please let me know - but I believe the models are a good staring point.

To improve the usability of the pretrained-models I created a small library with some helper functions to create the tokenizer and learner. See this notebook (Colab) for how to use the models:

If you have any questions or comments please let me know and I hope this is still helpful :).

Florian

8 Likes

I’ve always found it troubling that when BERT came out, nobody did a head-to-head comparison with ULMFiT. BERT has far more parameters and training data than ULMFiT, and consumes far more compute to pre-train than ULMFiT. I wish someone had scaled-up ULMFiT pretraining so it would be a fair comparison. Sadly we mostly can’t do that experiment because we don’t have the compute resources of Google. I build a BERT-from-scratch on colab, but I had a much smaller pre-training dataset than Google BERT. I wish Google would donate the compute to run that experiment. Has anyone approached them?

They actually have :slight_smile:

https://towardsdatascience.com/battle-of-the-heavyweights-bert-vs-ulmfit-faceoff-91a582a7c42b

IIRC overall ULMFiT is still better in generalization scenarios with classification (we still use ULMFiT over BERT at work), though BERT is useful in certain situations

Awesome initiative Florian! I 100% share your opinion about ULMFiT :slight_smile:

Wow, that’s really fast. Is that the case for all languages you’ve trained on? I am wondering because I thought the wikipedia data set sizes vary quite a bit across languages.

If they are comparable in typical performance measures, then why didn’t either Smerity or Sebastian Ruder defend it. Everyone seems to say that transformers have more context memory of longer strings of tokens, but with the computational cost of attention (proportional to the square of the number of tokens), I see BERT being weak in that regard. In any event, glad to year you are using ULMFiT.

Wow Florian, thank you a lot.

All other comments in this topic have been also amazing.

Deploying BERT in production is not straightforward. I prefer Multifit - it is also a lot faster to train.