Language Model Zoo 🦍

NandoBr · November 5, 2018, 11:21am

Done.

piotr.czapla · November 6, 2018, 9:47am

I’m sure many of you heard that multilingual BERT is out, which is a competing solution to ULMFiT (https://twitter.com/seb_ruder/status/1059439373396123649). Let see how this two compare. I guess BERT being new and large will be better. But given how big and slow it is ULMFiT may still be a better first choice for practical tasks, but we need to compare the two to be able to make an informative decision. I think we can do the work in the Language dependant threads and report back the findings here for anyone that is interested but can’t help.

tomsthom · November 6, 2018, 10:08am

Yes great idea. I will try the BERT model in french and compare with my ULMFiT results both in performance and time to train / do inference.

piotr.czapla · November 6, 2018, 10:20am

Awesome! I’ve added your thread (ULMFiT - French) to the wiki above for anyone that would like to join.
BTW. According to BERT readme, google collab has free TPU access so we can use that to fine tune the classifier.

piotr.czapla · November 6, 2018, 10:24am

Thank you for adding the thread to the wiki above :). For anyone that wants to join the work on Portuguese and play with the BERT as well feel free to join us here: ULMFit - Portuguese

piotr.czapla · November 6, 2018, 12:38pm

Can you make the Language thread then so that ppl can join in and participate? Please share what you found regarding the datasets and if you managed to train the model.

ertan · November 9, 2018, 11:58pm

Here is the thread for Turkish: ULMFiT - Turkish

ertan · November 10, 2018, 12:14am

Has there been any implementation/experimentation with Transformer architectures in fastai?

harikrishnanrajeev · November 17, 2018, 8:00am

Hi @jamsheer , is there a thread for ULMFit - Malayalam ?.

piotr.czapla · November 19, 2018, 5:10am

I don’t know about any experiments yet. but i know there are few ppl interested to give it a try.

tomashm · November 22, 2018, 10:13am

Has there been any progress on Norwegian?

sarnthil · November 22, 2018, 5:04pm

Hi everyone. I’d like to get ULMFit for Romanian. Anyone else working on it?

Virgil · November 23, 2018, 10:10am

Hi @sarnthil . We’re just starting work on it at the Timisoara study group. I have a lot of GCP credits that will expire soon so I plan to start training the language model on a wikipedia dump this weekend.

The plan is to find Romanian language classification datasets to fine-tune & apply the ULMFit LM to. Finding good datasets for it might be hardest task :). Did you find some already ?

sarnthil · November 26, 2018, 10:14pm

Cool. Good luck! Let me know how it went.
I have no romanian datasets at hand now. I could ask the nlp group in Bucharest if they have some classification datasets to test on Romanian.

cheers

hadrianpaulo · November 30, 2018, 6:03am

Hi everyone!

I started some work on creating language models for Filipino (Tagalog dialect).
Using Tagalog page entries from Wikimedia, the current results for the best performing language model is:

Perplexity: 26.199
Accuracy: 0.440317

Note that the accuracy is calculated from the validation set.

Next steps are to use the sentencepiece tokenizer and to test it with Filipino (Tagalog) classification datasets.

If anyone’s interested, check out the project here.

s.tsuruno · December 12, 2018, 4:30am

Hi.

I pretrained a language model for Japanese (including sentencepiece tokenization). > Thank you @piotr.czapla for your code and kind direction.
Details are here.
I’ve used the pretrained model for classification of disease-related tweets (MedWeb dataset) and achieved F1micro = 0.89, which is 0.03 points below SOTA.

I’ll post updates when our repo is ready.

piotr.czapla · December 12, 2018, 11:10pm

Doesn’t seems so, start a new thread and lets get that figured out. We are slowly ready with the ulmfit implementation for fastai1 so you might want to start there. Please start a language thread if it isn’t already.

@Sarnthil, @Virgil,
Remember to start a language thread and share your findings! I will be definitely interested to see how Romanian is going.

Superb! make a language thread as well. I’ve learned hard way that low perplexity does not necessarily translate to downstream tasks even on English. so we need to find a good benchmark to see how your model performs. But results looks promising.

Awesome this is good result, and it is superb that you found an open data set for Japanse. Can you start a language thread like this one: ULMFiT for Malay Language Project

And put your results there, we can get cooperating and try to get a bit above the SOTA :), there is plenty of nobs to turn to get good results and I can run some training on spare GPUs once we get the scripts implemented in ulmfit-multilingual.

praveen049 · December 16, 2018, 10:46pm

Are there are models for domain specific use cases like medical or sports ?

I would like to implement a model for telecommunications.

samh · December 17, 2018, 8:58pm

I think the idea is that the pre-trained model is trained on the whole language, and then fine-tuning to a domain would be done like in the IMDB example.

sabzo · January 4, 2019, 10:07pm

Hi all, I’m now reworking on this after 8 months or so. I had last built a language model for Xhosa with good results, however the same notebook that worked now produces a new error`‘numpy.ndarray’ object has no attribute ‘x’ very similar to “NameError: name ‘T’ is not defined” Deep Learning Part 2(dl2) Lesson 10 IMDB.

I’ve updated my fastAI repository, reinstalled fastai using conda/pip and I’m not able to fix it. Has anyone encountered this issue and solved it?

Thanks
`