Language Model Zoo 🦍

piotr.czapla · January 5, 2019, 5:09pm

That is usually achieved via finetuning LM trained on wikitext. Do you have large enough corpus 100m+ tokens to train LM from scratch?

The old code won’t work with new fastai it changed a lot. If you want to start from zero try GitHub - n-waves/multifit: The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761

jannen · January 14, 2019, 7:13am

What kind of total training times have people got when training a total LM with Wikipedia data?

Or what kind of training time could one expect for 450.000 articles, how many days of training for 1080 Ti or Tesla P100 for example?

Grigor · January 15, 2019, 9:51am

Hey. My dataset is a mixture of French and English and I have a classification problem. Can you give me some advice on using Ulmfit? Should I train a new LM on mixed French and English wiki? Thanks

s.tsuruno · January 21, 2019, 10:48am

Hi everyone. I’ve applied ULMFiT to Japanese and started a thread. Let me know if you’re interested.

ademyanchuk · January 24, 2019, 7:35am

Hello. I would ask if there is a recommended way to fintune LM on domain data. I’ve seen two ways.

Unfreeze all and train
Unfreeze gradially with freeze_() function

Or maybe I missed something. Any advices would be much appreciated.

Kaspar · January 24, 2019, 7:38am

the “lesson3-imdb.ipynb” is a good example

MicPie · January 24, 2019, 5:28pm

I would try first different learning rates for each layer group (lower learning rates at the input stage and higher at the end).

ademyanchuk · January 25, 2019, 2:50am

Thanks. I will try two approaches and see if it would be any different:

Same as in leeson3-imdb.ipynb
Unfreeze more gradually (with freeze_to)

ademyanchuk · January 27, 2019, 6:48am

In case someone will be interested in the future. On the Russian language the finetuning of language model with the same methodology as in leeson3-imdb.ipynb achieved the best result in all my experiments for now.

Another couple of questions:

In my intuition, we can achieve better result if we finetune language model on domain specific data with more training examples. In your experiments how big were domain specific corpuses?
Dose someone try max vocab of 100000 or more for LM finetuning step?

Thanks in advance.

piotr.czapla · January 27, 2019, 7:10pm

On wikitext-103 the model trains in ±18h on 1080TI

100k is huge, it makes it hard for model to learn useful relations between words for Russian you may want to use SentencePiece with 25k tokens, it works really well for Polish (better than sentence piece with 50k tokens, way better than 100k tokens).
You may check our paper & presentation there is an example that show how a different number of tokens influence the way a random sentence is being split.

Kaspar · January 27, 2019, 10:18pm

looks like the english wikipedia dump will be 25-27 mio sentences when i have finished the script to remove “abnormal sentence”. From my measurements one epoch will take 20 hours.

disisbig · January 31, 2019, 4:14am

ULMFit for Punjabi

SOTA for Language Modeling and Classifier

Github : https://github.com/goru001/nlp-for-punjabi
Clean, pre-processed Punjabi Wikipedia Data for training
Pre-trained Language Model for Punjabi

New Dataset for Punjabi Text Classification Challenges:

Clean, pre-processed BBC Punjabi News dataset
Pre-trained Classifier for punjabi, trained on above dataset

Please open a Github issue !

disisbig · January 31, 2019, 4:27pm

I’ve also trained a language model and classifier for Hindi, achieving a perplexity of ~35 on 20% validation set of 55k Hindi Wikipedia articles. I’m using Fastai v1 and Sentencepiece for Tokenization. I would like to compare our models on the BBC News classification dataset. Would you mind sharing your score?

piotr.czapla · February 1, 2019, 7:28am

@disisbig can you make a thread for you language and put it into the top entry? Re comparison we are in process of assembling the language models in one repository to ensure reproductability. https://github.com/n-waves/ulmfit-multilingual Do you want to contribute your lm and hyper Parmas?

disisbig · February 2, 2019, 11:39am

Thanks @piotr.czapla. I’ve created the threads for Hindi and Punjabi. I’ll soon raise a PR to contribute my models and hyper-params to ulmfit-multilingual

Benudek · February 22, 2019, 12:25am

Folks, would anyone know if one can use a language model (instead of word vecs) for sequence 2 sequence translation? Think Jeremy mentioned that in previous deep learnng part II in lesson 11 where he demoed translation wird word vecs.

Not sure I got this correct and its possible, pointers welcome.

@martijnd

noisefield · February 22, 2019, 10:50am

Hi! I have trained another model for the Russian language using Taiga corpus: ULMFiT - Russian

Kaspar · February 22, 2019, 2:04pm

the transformer + transformerxl can be used for that. see the paper attention is all you need

noisefield · February 22, 2019, 2:14pm

It is possible, but you need to define your own decoder on top of the hidden states returned by the language model.

Benudek · February 22, 2019, 2:51pm

Which paper, could you share a link ?