ULMFit - Portuguese

pierreguillou · October 15, 2019, 9:15pm

A post to say that I made a correction in the notebook of my TCU classifier and got an accuracy of 97.95% higher than my previous one.

I already updated the post, the notebook lm3-portuguese-classifier-TCU-jurisprudencia.ipynb (nbviewer) and the link to the tgz file in the models directory of my github.

Indeed (thanks to David Vieira), I noticed that the fine-tuning of the LM and classifier did not use the SentencePiece model and vocab trained for the General Portuguese Language Model (lm3-portuguese.ipynb).

For example, the code used to create the fine-tuned Portuguese forward LM was wrong:

`data_lm = (TextList.from_df(df_trn_val, path, cols=reviews, 
            processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])
           .split_by_rand_pct(0.1, seed=42)
           .label_for_lm()
           .databunch(bs=bs, num_workers=1))`

It has been corrected by using the SPProcessor.load() function:

`data_lm = (TextList.from_df(df_trn_val, path, cols=reviews, 
            processor=SPProcessor.load(dest))
           .split_by_rand_pct(0.1, seed=42)
           .label_for_lm()
           .databunch(bs=bs, num_workers=1))`

Therefore, I retrained the fine-tuned Portuguese forward LM and the classifier on TCU jurisprudência dataset and I got better results!

(fine-tuned) Language Model
- forward : (accuracy) 51.56% instead of 44.66% | (perplexity) 11.38 instead of 15.97
- backward: (accuracy) 52.15% instead of 44.97% | (perplexity) 12.54 instead of 18.73
(fine-tuned) Text Classifier
- Accuracy (ensemble) 97.95% instead of 97.39%
- f1 score (ensemble): 0.9795 instead of 0.9737