I’ve been working on applying ULMFiT to Bangla language using fast.ai v1.
The code and the pretrained models can be will be available soon. Two datasets were used for the first phase. A News Dataset and Wikipedia dump.
- Pretrained a sentencepiece model for tokenization
- Pretrained a language model (with and without sentencepiece)
- Fine-tuned the language model and trained a classifier for the two datasets of authors writings (6 and 16 authors)
- LM perplexity = 51 (further possible) , still working on it.
- Classsification results definitely beat previous results of models using word embediings. Accuracy are 94% and 99.6% in the datasets respectively.
- a proper repo will be created and link will be posted on project completion.
models and datasets will be made avaiable after proper permissions
- sentence piece for bangla doesnt seem too different from words. most tokens are full words, so the results dont differ too much. better tokenization might be a good idea, any suggestion on this will be helpful.