Language Model Zoo 🦍

(Alexey) #354

Hello. I would ask if there is a recommended way to fintune LM on domain data. I’ve seen two ways.

  1. Unfreeze all and train
  2. Unfreeze gradially with freeze_() function

Or maybe I missed something. Any advices would be much appreciated.


(Kaspar Lund) #355

the “lesson3-imdb.ipynb” is a good example


(Michael) #356

I would try first different learning rates for each layer group (lower learning rates at the input stage and higher at the end).


(Alexey) #357

Thanks. I will try two approaches and see if it would be any different:

  1. Same as in leeson3-imdb.ipynb
  2. Unfreeze more gradually (with freeze_to)

(Alexey) #358

In case someone will be interested in the future. On the Russian language the finetuning of language model with the same methodology as in leeson3-imdb.ipynb achieved the best result in all my experiments for now.

Another couple of questions:

  1. In my intuition, we can achieve better result if we finetune language model on domain specific data with more training examples. In your experiments how big were domain specific corpuses?
  2. Dose someone try max vocab of 100000 or more for LM finetuning step?

Thanks in advance.


(Piotr Czapla) #359

On wikitext-103 the model trains in ±18h on 1080TI

100k is huge, it makes it hard for model to learn useful relations between words for Russian you may want to use SentencePiece with 25k tokens, it works really well for Polish (better than sentence piece with 50k tokens, way better than 100k tokens).
You may check our paper & presentation there is an example that show how a different number of tokens influence the way a random sentence is being split.

1 Like

(Kaspar Lund) #360

looks like the english wikipedia dump will be 25-27 mio sentences when i have finished the script to remove “abnormal sentence”. From my measurements one epoch will take 20 hours.


(Gaurav) #361

ULMFit for Punjabi

SOTA for Language Modeling and Classifier

New Dataset for Punjabi Text Classification Challenges:

Please open a Github issue !


(Gaurav) #362

I’ve also trained a language model and classifier for Hindi, achieving a perplexity of ~35 on 20% validation set of 55k Hindi Wikipedia articles. I’m using Fastai v1 and Sentencepiece for Tokenization. I would like to compare our models on the BBC News classification dataset. Would you mind sharing your score?


(Piotr Czapla) #363

@disisbig can you make a thread for you language and put it into the top entry? Re comparison we are in process of assembling the language models in one repository to ensure reproductability. Do you want to contribute your lm and hyper Parmas?


(Gaurav) #364

Thanks @piotr.czapla. I’ve created the threads for Hindi and Punjabi. I’ll soon raise a PR to contribute my models and hyper-params to ulmfit-multilingual

1 Like

(benedikt herudek) #365

Folks, would anyone know if one can use a language model (instead of word vecs) for sequence 2 sequence translation? Think Jeremy mentioned that in previous deep learnng part II in lesson 11 where he demoed translation wird word vecs.

Not sure I got this correct and its possible, pointers welcome.




Hi! I have trained another model for the Russian language using Taiga corpus: ULMFiT - Russian


(Kaspar Lund) #367

the transformer + transformerxl can be used for that. see the paper attention is all you need

1 Like


It is possible, but you need to define your own decoder on top of the hidden states returned by the language model.

1 Like

(benedikt herudek) #369

Which paper, could you share a link ?


(Kaspar Lund) #370
1 Like

(benedikt herudek) #371

Thx, so this has a language model or it a good way to make translations without RNNs?


(Kaspar Lund) #372

the breakthrough in this paper is that it is not a RNN. RNNs takes a long time to train and have issues with translating long sentences. I have been training RNNs where it tok 15 hour to process 10 epochs on 2.5e8 tokens. The awd_lstm rrn in fastai is very interesting as a model it just requires a lot of patience to train

The same perplexity/accuracy can be reached in about an hour using the transformerXL @sgugger implemented recently. it handles long sentences much more elegantly (attention mechanism) and can be parallelized.

In short - if you want to train languagemodels for translation or classification etc. you will do it faster an better using the transformerXL model.

1 Like


If you manage to do that, please tell me how. Those models are heavy and require a much longer time to train! Those it’s true it takes less epochs to reach a ppl as low as the AWD-LSTM on WT103, it still takes more compute time.

1 Like