ULMFiT - Bangla

tanny411 · June 23, 2019, 2:47am

I’ve been working on applying ULMFiT to Bangla language using fast.ai v1.
The code and the pretrained models can be will be available soon. Two datasets were used for the first phase. A News Dataset and Wikipedia dump.

Summary

Pretrained a sentencepiece model for tokenization
Pretrained a language model (with and without sentencepiece)
Fine-tuned the language model and trained a classifier for the two datasets of authors writings (6 and 16 authors)
LM perplexity = 51 (further possible) , still working on it.
Classsification results definitely beat previous results of models using word embediings. Accuracy are 94% and 99.6% in the datasets respectively.
a proper repo will be created and link will be posted on project completion.

models and datasets will be made avaiable after proper permissions

Observations:

sentence piece for bangla doesnt seem too different from words. most tokens are full words, so the results dont differ too much. better tokenization might be a good idea, any suggestion on this will be helpful.

tanny411 · June 23, 2019, 2:48am

@abyaadrafid share your results too.

abyaadrafid · August 29, 2019, 6:06am

A very messy notebook for pretrained model, here.
Another very messy notebook retraining on newspaper articles, here.

abyaadrafid · August 29, 2019, 6:09am

Transformer and TransformerXL on the same datasets yield abysmal results. These models literally learn nothing. There might be something I’m missing.
Any insights would be appreciated.

ariyanhasan · October 28, 2019, 10:08pm

What is your vocabulary size when you used words tokenization?
And what is your vocabulary size after using sentences prices?

tanny411 · October 29, 2019, 4:49am

60k and 30k as suggested by jeremy.

ariyanhasan · October 29, 2019, 5:12am

The purpose of using sentences pieces is to make sure that there will be no issue about OOV (out-of-vocabulary). If the differences between this two are not good enough than it’s difficult to observe the difference. Instead of using 30k vocab size, use 2000/1000. You will see the difference for Bangla.

tanny411 · January 9, 2020, 1:30pm

All codes and datasets are here. Cannot edit my own post for some reason. I have pre-trained models for news and wikipedia corpus on 3 levels of tokenization, word, subword and character. subword level performed best in the downstream tasks. details in paper.

msivanes · January 9, 2020, 4:45pm

Congratulations @tanny411. Looking forward to your paper.