I’ve been working on applying ULMFiT to Bangla language using fast.ai v1.
The code and the pretrained models can be will be available soon. Two datasets were used for the first phase. A News Dataset and Wikipedia dump.
Summary
Pretrained a sentencepiece model for tokenization
Pretrained a language model (with and without sentencepiece)
Fine-tuned the language model and trained a classifier for the two datasets of authors writings (6 and 16 authors)
LM perplexity = 51 (further possible) , still working on it.
Classsification results definitely beat previous results of models using word embediings. Accuracy are 94% and 99.6% in the datasets respectively.
a proper repo will be created and link will be posted on project completion.
models and datasets will be made avaiable after proper permissions
Observations:
sentence piece for bangla doesnt seem too different from words. most tokens are full words, so the results dont differ too much. better tokenization might be a good idea, any suggestion on this will be helpful.
Transformer and TransformerXL on the same datasets yield abysmal results. These models literally learn nothing. There might be something I’m missing.
Any insights would be appreciated.
The purpose of using sentences pieces is to make sure that there will be no issue about OOV (out-of-vocabulary). If the differences between this two are not good enough than it’s difficult to observe the difference. Instead of using 30k vocab size, use 2000/1000. You will see the difference for Bangla.
All codes and datasets are here. Cannot edit my own post for some reason. I have pre-trained models for news and wikipedia corpus on 3 levels of tokenization, word, subword and character. subword level performed best in the downstream tasks. details in paper.