ULMFiT - Bangla

I’ve been working on applying ULMFiT to Bangla language using fast.ai v1.
The code and the pretrained models can be will be available soon. Two datasets were used for the first phase. A News Dataset and Wikipedia dump.

Summary

  • Pretrained a sentencepiece model for tokenization
  • Pretrained a language model (with and without sentencepiece)
  • Fine-tuned the language model and trained a classifier for the two datasets of authors writings (6 and 16 authors)
  • LM perplexity = 51 (further possible) , still working on it.
  • Classsification results definitely beat previous results of models using word embediings. Accuracy are 94% and 99.6% in the datasets respectively.
  • a proper repo will be created and link will be posted on project completion.

models and datasets will be made avaiable after proper permissions

Observations:

  • sentence piece for bangla doesnt seem too different from words. most tokens are full words, so the results dont differ too much. better tokenization might be a good idea, any suggestion on this will be helpful.
1 Like

@abyaadrafid share your results too.

A very messy notebook for pretrained model, here.
Another very messy notebook retraining on newspaper articles, here.

Transformer and TransformerXL on the same datasets yield abysmal results. These models literally learn nothing. There might be something I’m missing.
Any insights would be appreciated.

What is your vocabulary size when you used words tokenization?
And what is your vocabulary size after using sentences prices?

60k and 30k as suggested by jeremy.

The purpose of using sentences pieces is to make sure that there will be no issue about OOV (out-of-vocabulary). If the differences between this two are not good enough than it’s difficult to observe the difference. Instead of using 30k vocab size, use 2000/1000. You will see the difference for Bangla.

1 Like


All codes and datasets are here. Cannot edit my own post for some reason. I have pre-trained models for news and wikipedia corpus on 3 levels of tokenization, word, subword and character. subword level performed best in the downstream tasks. details in paper.

1 Like

Congratulations @tanny411. Looking forward to your paper.

1 Like