I am doing the fastai course (2019), and I am new on this forum. I am interested in applying fastai to NLP, in the dutch language.
I looked at the pretrained models available in ulmfit-multilingual (pretrained_lm_models.zip) but it does not contain dutch.
Is there a pretrained dutch model available?
I would like to contribute for Bangla Language. Can someone give me a headstart? Are there any instructions to make the wiki dataset? Would be very helpful. Thanks. Also looking forward to using sentence piece.
The contact person listed in the ULMFiT for Bangla seems to be inactive for over an year. Is there anyone actually working on it?
Also I found this project in the wild. Has a wikipedia Bangla corpus; didnāt get the opportunity to check it out, might be useful to you.
Iām also trying to find a way to use wikipedia data dumps. Iāll share the dataset if I manage to do something.
I havent found anyone else working on Bangla. I am currently working on it.
I have actually checked out the project you mentioned. The dataset seems small. So I was thinking of building a larger dataset.
Sure will open a thread a soon. Not sure about the indian language project. Ive followed jeremys classes. Didnt use any separate language specific tokenization.
Hello everyone! I am working as a researcher at the Turku University Hospital. We have quite nice GPU resources here and I have trained a Finnish ULMFiT model on the Finnish wikipedia using the n-waves scripts. I reached a perplexity of about 23.6. Iāll double check with the employer if I can just open source the model, vocabulary and an example classification done on open data (city of Turku feedback classification with a few specific classes)
What is the best way of sharing the model and stuff if I get the green light? I can put them on github, but is there some other model zoo or something where the different models are more easily accessible?
In case anyone is interested, here is a link to a finnish model trained on wikipedia with n-waves, got a validation perplexity of about 23.8:
A notebook to make a classifier is also included! Maybe that could be helpful for others too, who would like to use the pretrained models for experiments. At least the config n_hid=1150 thing caused me to lose a few hairsā¦
First of all thanks for making the Dutch language model available! It helps greatly. I was searching to find a Dutch dataset that could be used for bench marking but couldnāt find any. For German I came across http://www.spinningbytes.com/resources/. Was wondering if you know of any Dutch datasets?
Hi James! Iām glad to hear the Dutch language model was of use to you.
Do you mean with benchmarking the performance of the language model on downstream tasks e.g. a classification? Iāve created a dataset for this purpose, the 110k Dutch Book Review Dataset (110kDBRD). You should be able to get around 94% accuracy on the out-of-the-box dataset.
Thanks Benjamin for creating the Book review dataset and sharing it, great work! Actually I was looking for a public dataset that is used in an academic paper. Anyways, it doesnāt matter much. I used the Dutch language model, and I tried on a classification task - with around 600 samples per category I am getting close to 90% accuracy