Hi everyone, I wanted to share some work I did the past week. During the new NLP course, Jeremy mentions the possibility of trying an ensemble with the new SentencePiece tokenizer along with Spacy with forwards and backwards models to try to push state-of-the-art. I decided to see if it could be done! My article is here but I will sum it up, as I think I may need some help!
My notebook is here, there’s not quite much of a reasoning or walk-through with it yet, I will add more when I have time (or I may wait until I redo this) notebook
I followed the Turkish notebook for generating my databunch, but for those wanting to learn how to use SentencePiece, you tag on a processor like so:
data_lm_spp_fwd = (TextList.from_folder(path, processor=[OpenFileProcessor(), SPProcessor()])
.split_by_rand_pct(0.1, seed=42)
.label_for_lm()
.databunch(bs=128, num_workers=4, backwards=False))
And then when it’s done training, you pass in SPProcessor.load(path)
to replace it when creating the Classification databunches.
In terms of training, I followed how the IMDB more notebook is, where I fit the language model for 1 epoch originally then unfroze and trained for ten at 10x of a smaller learning rate with mixed precision. Along with this the language model was trained on everything available to it.
Afterwards, I trained the Classifier for 5 epochs, gradually unfreezing. One thing I will note, I had issues getting to_fp16()
to work on the classifier, so I’m unsure how to fix it right now. Here are those training loops:
def train_lm(models:list):
names = ['fwd', 'bwd']
x = 0
for model in models:
lr = 1e-2
lr *= 64/48
model.fit_one_cycle(1, lr, moms=(0.8,0.7))
model.unfreeze()
model.fit_one_cycle(10, lr/10, moms=(0.8,0.7))
model.save(f'spp_{names[x]}_fine_tuned_10')
model.save_encoder(f'spp_{names[x]}_fine_tuned_enc_10')
return models
res = []
targs = []
for learn in learns:
learn.fit_one_cycle(1, lr, moms=(0.8,0.7))
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(lr/(2.6**4), lr), moms=(0.8,0.7))
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(lr/2/(2.6**4), lr/2), moms=(0.8,0.7))
learn.unfreeze()
learn.fit_one_cycle(2, slice(lr/10/(2.6**4), lr/10), moms=(0.8,0.7))
preds, targ = learn.get_preds(ordered=True)
res.append(preds)
targs.append(targ)
Now onto the results, the important bit! One thing I will note, I was not able to achieve @Jeremy 's 95% they managed in the paper. I’m unsure where exactly I went differently, so hopefully I can achieve some answer here.
Results:
Name | Accuracy |
---|---|
Spacy Forward | 94.49% |
Spacy Forward and Backwards | 94.77% |
SentencePiece Forward | 94.55% |
SentencePiece Forward and Backwards | 94.66% |
Spacy and SentencePiece Forward | 94.86% |
Spacy and SentencePiece Backwards | 94.79% |
Spacy Forward and Backward and SentencePiece Forward | 94.89% |
Spacy Forward and Backward and SentencePiece Backwards | 94.88% |
Spacy and SentencePiece Forward and Backwards | 94.94% |
The important parts to note, overall SentencePiece alone did perform slightly worse, but when ensembled I achieved a boost of 0.17%. I believe when we are talking so high of percentages, this is significant. Others can chime in otherwise.
Things I would like assistance with:
- What did I do wrong or miss from the paper?
- Was there any noticeable accuracy changes that could have occurred by not using Mixed Precision and a smaller batch size?
Otherwise, thanks for reading and thank you Rachel and Jeremy for the amazing NLP course!
Further ideas to test:
- New optimizers
- Label Smoothing