ULMFiT pretraining + pretrained models

I know … the NLP hypetrain has moved on to HuggingFace and transformers - but I still love ULMFiT.

fastai and ULMFiT are still great - easy to use, fast to train and imho still SOTA for classification. But what was missing (esp. for fastai2 with SentencePiece) were easy to use pretraining scripts and pretrained models. Thats what I am trying to fix with this repo :slight_smile: .

About 4 months ago I published a repository to pretrain a German ULMFiT with SentencePiece from scratch - based on information on the forums and the nlp-course notebooks for Turkish and Vietnamese. The repo didn’t contain any documentation and I got a couple of questions and requests which made me think that there is still some interest in ULMFiT.

So despite the (100% legitimate) huggingface-hype I put some time and effort in improving the repo, the pretraining script and also trained some ULMFiT models (links in the repo).

  • German
  • Dutch
  • Russian
  • Portuguese
  • Vietnamese
  • Japanese
  • Italian
  • Spanish
  • Korean
  • Thai
  • Hebrew
  • Arabic

All Models use SentencePiece, vocab size 15.000 (seems small but there was no improvement on the downstream tasks for the datasets I evaluated the models) and have a forward and backward model.

Pretraining requires about 4 hours (1h wikipedia download an preparation + 3 hours training on a RTX 3090) and less than 8 GB GPU RAM with batch size 128.

I validated most on the models on classification datasets and tried to compare the results with BERT papers without a lot of hyper parameter tuning - sometimes my models were a bit better than BERT, sometimes they performed slightly worse. I am sure there is room for improvement - if you find better hyper parameters or think other vocal-sizes work better with some languages please let me know - but I believe the models are a good staring point.

To improve the usability of the pretrained-models I created a small library with some helper functions to create the tokenizer and learner. See this notebook (Colab) for how to use the models:

If you have any questions or comments please let me know and I hope this is still helpful :).

Florian

13 Likes

I’ve always found it troubling that when BERT came out, nobody did a head-to-head comparison with ULMFiT. BERT has far more parameters and training data than ULMFiT, and consumes far more compute to pre-train than ULMFiT. I wish someone had scaled-up ULMFiT pretraining so it would be a fair comparison. Sadly we mostly can’t do that experiment because we don’t have the compute resources of Google. I build a BERT-from-scratch on colab, but I had a much smaller pre-training dataset than Google BERT. I wish Google would donate the compute to run that experiment. Has anyone approached them?

They actually have :slight_smile:

https://towardsdatascience.com/battle-of-the-heavyweights-bert-vs-ulmfit-faceoff-91a582a7c42b

IIRC overall ULMFiT is still better in generalization scenarios with classification (we still use ULMFiT over BERT at work), though BERT is useful in certain situations

2 Likes

Awesome initiative Florian! I 100% share your opinion about ULMFiT :slight_smile:

Wow, that’s really fast. Is that the case for all languages you’ve trained on? I am wondering because I thought the wikipedia data set sizes vary quite a bit across languages.

If they are comparable in typical performance measures, then why didn’t either Smerity or Sebastian Ruder defend it. Everyone seems to say that transformers have more context memory of longer strings of tokens, but with the computational cost of attention (proportional to the square of the number of tokens), I see BERT being weak in that regard. In any event, glad to year you are using ULMFiT.

Wow Florian, thank you a lot.

All other comments in this topic have been also amazing.

Deploying BERT in production is not straightforward. I prefer Multifit - it is also a lot faster to train.

No the training times were different between the languages and it was not always easy to find the right number of “words” (for example Japanese oder korean). I tried to adjust the number of articles and the minimum length so I get close to 18 minutes training time per epoch (160k German articles which I verified that I got good results). For some languages there were not enough articles in the dump so the dataset were smaller / training times were faster. Also I had the impression that too short articles harm the performance … so there’s a bit of a tradeoff.

At the end I think all models I trained work quite well but I didn’t train on “tiny” wikipedias.

I’ve searched around a bit but might have missed something - does anyone know of any pretrained sentencepiece LMs for English? If not @florianl is this something you’d want a pull request to add to your repository?

after a couple of requests I’ve added a callback to extract sentence embeddings to the library.

see the following notebook (scrolll to the end :wink: )

pip install fastai-ulmfit

# create dls and fine-tune learner
learn.fine_tune(...

from fastai_ulmfit.embeddings import SentenceEmbeddingCallback

se = SentenceEmbeddingCallback(pool_mode='concat')
_ = learn.get_preds(cbs=[se])
# the embedding vectors etc ... are stored in the dict se.feat

# show scatterplot
from sklearn.decomposition import PCA
%matplotlib inline
import matplotlib.pyplot as plt

feat = se.feat
pca = PCA(n_components=2)
pca.fit(feat['vec'])
coords = pca.transform(feat['vec'])
target_labels = [dls.vocab[1][l] for l in feat['y']]
pred_labels = [dls.vocab[1][l] for l in feat['pred']]
df_preds = pd.DataFrame({'x': coords[:,0], 'y': coords[:,1], 
                         'text': feat['text'], 
                         'target_labels': target_labels, 
                         'pred_labels': pred_labels, 
                         'color': feat['pred']})

plt.scatter(df_preds['x'].tolist(), df_preds['y'].tolist(), c=df_preds['color'].tolist(), cmap='viridis')

If you have questions let me know.

Florian

2 Likes