ULMFiT - Bangla

(Aisha Khatun) #1

I’ve been working on applying ULMFiT to Bangla language using fast.ai v1.
The code and the pretrained models can be will be available soon. Two datasets were used for the first phase. A News Dataset and Wikipedia dump.

Summary

  • Pretrained a sentencepiece model for tokenization
  • Pretrained a language model (with and without sentencepiece)
  • Fine-tuned the language model and trained a classifier for the two datasets of authors writings (6 and 16 authors)
  • LM perplexity = 51 (further possible) , still working on it.
  • Classsification results definitely beat previous results of models using word embediings. Accuracy are 94% and 99.6% in the datasets respectively.
  • a proper repo will be created and link will be posted on project completion.

models and datasets will be made avaiable after proper permissions

Observations:

  • sentence piece for bangla doesnt seem too different from words. most tokens are full words, so the results dont differ too much. better tokenization might be a good idea, any suggestion on this will be helpful.
1 Like

Language Model Zoo :gorilla:
(Aisha Khatun) #2

@abyaadrafid share your results too.

0 Likes

(Rafid Abyaad) #3

A very messy notebook for pretrained model, here.
Another very messy notebook retraining on newspaper articles, here.

0 Likes

(Rafid Abyaad) #4

Transformer and TransformerXL on the same datasets yield abysmal results. These models literally learn nothing. There might be something I’m missing.
Any insights would be appreciated.

0 Likes