Language Model Zoo 🦍

(Fernando Melo) #330


(Piotr Czapla) #331

I’m sure many of you heard that multilingual BERT is out, which is a competing solution to ULMFiT ( Let see how this two compare. I guess BERT being new and large will be better. But given how big and slow it is ULMFiT may still be a better first choice for practical tasks, but we need to compare the two to be able to make an informative decision. I think we can do the work in the Language dependant threads and report back the findings here for anyone that is interested but can’t help.

(Thomas Chambon) #332

Yes great idea. I will try the BERT model in french and compare with my ULMFiT results both in performance and time to train / do inference.

(Piotr Czapla) #333

Awesome! I’ve added your thread (ULMFiT - French) to the wiki above for anyone that would like to join.
BTW. According to BERT readme, google collab has free TPU access so we can use that to fine tune the classifier.

(Piotr Czapla) #334

Thank you for adding the thread to the wiki above :). For anyone that wants to join the work on Portuguese and play with the BERT as well feel free to join us here: ULMFit - Portuguese

(Piotr Czapla) #335

Can you make the Language thread then so that ppl can join in and participate? Please share what you found regarding the datasets and if you managed to train the model.

(Ertan Dogrultan) #336

Here is the thread for Turkish: ULMFiT - Turkish

(Ertan Dogrultan) #337

Has there been any implementation/experimentation with Transformer architectures in fastai?

ULMFiT - French
(hari rajeev) #338

Hi @jamsheer , is there a thread for ULMFit - Malayalam ?.

(Piotr Czapla) #339

I don’t know about any experiments yet. but i know there are few ppl interested to give it a try.


Has there been any progress on Norwegian?

(Laura Ana Maria Bostan) #341

Hi everyone. I’d like to get ULMFit for Romanian. Anyone else working on it?

(Virgil Petcu) #342

Hi @sarnthil . We’re just starting work on it at the Timisoara study group. I have a lot of GCP credits that will expire soon so I plan to start training the language model on a wikipedia dump this weekend.

The plan is to find Romanian language classification datasets to fine-tune & apply the ULMFit LM to. Finding good datasets for it might be hardest task :). Did you find some already ?

(Laura Ana Maria Bostan) #343

Cool. Good luck! Let me know how it went.
I have no romanian datasets at hand now. I could ask the nlp group in Bucharest if they have some classification datasets to test on Romanian.


(Hadrian Lim) #344

Hi everyone!

I started some work on creating language models for Filipino (Tagalog dialect).
Using Tagalog page entries from Wikimedia, the current results for the best performing language model is:

Perplexity: 26.199
Accuracy: 0.440317

Note that the accuracy is calculated from the validation set.

Next steps are to use the sentencepiece tokenizer and to test it with Filipino (Tagalog) classification datasets.

If anyone’s interested, check out the project here.