Language Model Zoo šŸ¦

Hi everyone. Iā€™d like to get ULMFit for Romanian. Anyone else working on it?

1 Like

Hi @sarnthil . Weā€™re just starting work on it at the Timisoara study group. I have a lot of GCP credits that will expire soon so I plan to start training the language model on a wikipedia dump this weekend.

The plan is to find Romanian language classification datasets to fine-tune & apply the ULMFit LM to. Finding good datasets for it might be hardest task :). Did you find some already ?

2 Likes

Cool. Good luck! Let me know how it went.
I have no romanian datasets at hand now. I could ask the nlp group in Bucharest if they have some classification datasets to test on Romanian.

cheers

Hi everyone!

I started some work on creating language models for Filipino (Tagalog dialect).
Using Tagalog page entries from Wikimedia, the current results for the best performing language model is:

Perplexity: 26.199
Accuracy: 0.440317

Note that the accuracy is calculated from the validation set.

Next steps are to use the sentencepiece tokenizer and to test it with Filipino (Tagalog) classification datasets.

If anyoneā€™s interested, check out the project here.

1 Like

Hi.

I pretrained a language model for Japanese (including sentencepiece tokenization). > Thank you @piotr.czapla for your code and kind direction.
Details are here.
Iā€™ve used the pretrained model for classification of disease-related tweets (MedWeb dataset) and achieved F1micro = 0.89, which is 0.03 points below SOTA.

Iā€™ll post updates when our repo is ready.

1 Like

Doesnā€™t seems so, start a new thread and lets get that figured out. We are slowly ready with the ulmfit implementation for fastai1 so you might want to start there. Please start a language thread if it isnā€™t already.

@Sarnthil, @Virgil,
Remember to start a language thread and share your findings! I will be definitely interested to see how Romanian is going.

Superb! make a language thread as well. Iā€™ve learned hard way that low perplexity does not necessarily translate to downstream tasks even on English. so we need to find a good benchmark to see how your model performs. But results looks promising.

Awesome this is good result, and it is superb that you found an open data set for Japanse. Can you start a language thread like this one: ULMFiT for Malay Language Project

And put your results there, we can get cooperating and try to get a bit above the SOTA :), there is plenty of nobs to turn to get good results and I can run some training on spare GPUs once we get the scripts implemented in ulmfit-multilingual.

Are there are models for domain specific use cases like medical or sports ?

I would like to implement a model for telecommunications. :slight_smile: :slight_smile:

I think the idea is that the pre-trained model is trained on the whole language, and then fine-tuning to a domain would be done like in the IMDB example.

2 Likes

Hi all, Iā€™m now reworking on this after 8 months or so. I had last built a language model for Xhosa with good results, however the same notebook that worked now produces a new error`ā€˜numpy.ndarrayā€™ object has no attribute ā€˜xā€™ very similar to ā€œNameError: name ā€˜Tā€™ is not definedā€ Deep Learning Part 2(dl2) Lesson 10 IMDB.

Iā€™ve updated my fastAI repository, reinstalled fastai using conda/pip and Iā€™m not able to fix it. Has anyone encountered this issue and solved it?

Thanks
`

That is usually achieved via finetuning LM trained on wikitext. Do you have large enough corpus 100m+ tokens to train LM from scratch?

The old code wonā€™t work with new fastai it changed a lot. If you want to start from zero try https://github.com/n-waves/ulmfit-multilingual

1 Like

What kind of total training times have people got when training a total LM with Wikipedia data?

Or what kind of training time could one expect for 450.000 articles, how many days of training for 1080 Ti or Tesla P100 for example?

Hey. My dataset is a mixture of French and English and I have a classification problem. Can you give me some advice on using Ulmfit? Should I train a new LM on mixed French and English wiki? Thanks

Hi everyone. Iā€™ve applied ULMFiT to Japanese and started a thread. Let me know if youā€™re interested.

3 Likes

Hello. I would ask if there is a recommended way to fintune LM on domain data. Iā€™ve seen two ways.

  1. Unfreeze all and train
  2. Unfreeze gradially with freeze_() function

Or maybe I missed something. Any advices would be much appreciated.

the ā€œlesson3-imdb.ipynbā€ is a good example

I would try first different learning rates for each layer group (lower learning rates at the input stage and higher at the end).

Thanks. I will try two approaches and see if it would be any different:

  1. Same as in leeson3-imdb.ipynb
  2. Unfreeze more gradually (with freeze_to)

In case someone will be interested in the future. On the Russian language the finetuning of language model with the same methodology as in leeson3-imdb.ipynb achieved the best result in all my experiments for now.

Another couple of questions:

  1. In my intuition, we can achieve better result if we finetune language model on domain specific data with more training examples. In your experiments how big were domain specific corpuses?
  2. Dose someone try max vocab of 100000 or more for LM finetuning step?

Thanks in advance.

On wikitext-103 the model trains in Ā±18h on 1080TI

100k is huge, it makes it hard for model to learn useful relations between words for Russian you may want to use SentencePiece with 25k tokens, it works really well for Polish (better than sentence piece with 50k tokens, way better than 100k tokens).
You may check our paper & presentation there is an example that show how a different number of tokens influence the way a random sentence is being split.

1 Like

looks like the english wikipedia dump will be 25-27 mio sentences when i have finished the script to remove ā€œabnormal sentenceā€. From my measurements one epoch will take 20 hours.

2 Likes