For those bunch of folks who are working on building Language Models in the SF Study group, I have made some headway into building a language model for an Indian language called Telugu and I’ve hosted the work here: https://github.com/binga/fastai_notes/tree/master/experiments/notebooks/lang_models
I have been able to create a language model and fortunately I found a sentiment analysis dataset with ~5000 records to test drive the backbone. I’ll give it a shot tomorrow.
Incase all of us collectively make some good progress on respective languages, we could create a larger project which hosts many language models from one repo under the fastai banner and donate it to the NLP community. What do you all think? (Apologies if I’m jumping the gun here!)
The goals and todos of the project are documented here: https://docs.google.com/document/d/1KtwqGcWe0JEzJlI43sAkLRnZrPTRBMCbjE-Q355hkvw/edit#