Revisiting IMDB: Can we push state of the art? - An attempt

Hi everyone, I wanted to share some work I did the past week. During the new NLP course, Jeremy mentions the possibility of trying an ensemble with the new SentencePiece tokenizer along with Spacy with forwards and backwards models to try to push state-of-the-art. I decided to see if it could be done! My article is here but I will sum it up, as I think I may need some help!

My notebook is here, there’s not quite much of a reasoning or walk-through with it yet, I will add more when I have time (or I may wait until I redo this) notebook

I followed the Turkish notebook for generating my databunch, but for those wanting to learn how to use SentencePiece, you tag on a processor like so:

data_lm_spp_fwd = (TextList.from_folder(path, processor=[OpenFileProcessor(), SPProcessor()])
                  .split_by_rand_pct(0.1, seed=42)
                  .databunch(bs=128, num_workers=4, backwards=False))

And then when it’s done training, you pass in SPProcessor.load(path) to replace it when creating the Classification databunches.

In terms of training, I followed how the IMDB more notebook is, where I fit the language model for 1 epoch originally then unfroze and trained for ten at 10x of a smaller learning rate with mixed precision. Along with this the language model was trained on everything available to it.

Afterwards, I trained the Classifier for 5 epochs, gradually unfreezing. One thing I will note, I had issues getting to_fp16() to work on the classifier, so I’m unsure how to fix it right now. Here are those training loops:

def train_lm(models:list):
	names = ['fwd', 'bwd']
    x = 0
    for model in models:
        lr = 1e-2
        lr *= 64/48
        model.fit_one_cycle(1, lr, moms=(0.8,0.7))
        model.fit_one_cycle(10, lr/10, moms=(0.8,0.7))
    return models
res = []
targs = []
for learn in learns:
    learn.fit_one_cycle(1, lr, moms=(0.8,0.7))
    learn.fit_one_cycle(1, slice(lr/(2.6**4), lr), moms=(0.8,0.7))
    learn.fit_one_cycle(1, slice(lr/2/(2.6**4), lr/2), moms=(0.8,0.7))
    learn.fit_one_cycle(2, slice(lr/10/(2.6**4), lr/10), moms=(0.8,0.7))
    preds, targ = learn.get_preds(ordered=True)

Now onto the results, the important bit! One thing I will note, I was not able to achieve @Jeremy 's 95% they managed in the paper. I’m unsure where exactly I went differently, so hopefully I can achieve some answer here.


Name Accuracy
Spacy Forward 94.49%
Spacy Forward and Backwards 94.77%
SentencePiece Forward 94.55%
SentencePiece Forward and Backwards 94.66%
Spacy and SentencePiece Forward 94.86%
Spacy and SentencePiece Backwards 94.79%
Spacy Forward and Backward and SentencePiece Forward 94.89%
Spacy Forward and Backward and SentencePiece Backwards 94.88%
Spacy and SentencePiece Forward and Backwards 94.94%

The important parts to note, overall SentencePiece alone did perform slightly worse, but when ensembled I achieved a boost of 0.17%. I believe when we are talking so high of percentages, this is significant. Others can chime in otherwise.

Things I would like assistance with:

  • What did I do wrong or miss from the paper?
  • Was there any noticeable accuracy changes that could have occurred by not using Mixed Precision and a smaller batch size?

Otherwise, thanks for reading and thank you Rachel and Jeremy for the amazing NLP course! :slight_smile:

Further ideas to test:

  • New optimizers
  • Label Smoothing

I loved your write up and analysis, thank you for sharing!

I tried using the new radam, but I didnt see any improvements, it seems like the biggest advantage is when you are doing 30+ epochs.

What are your thoughts on label smoothing?

Do have any thoughts on how convert the SentencePiece parts back into words?

1 Like

@Daniel.R.Armstrong thank you for your kind words!

What are your thoughts on label smoothing?

On label smoothing: I did a small test on the sample to just to see, I did not see any improvements whatsoever, but this should be done on the entire dataset. For the next few days I am out of a laptop so I can’t run anything large yet.

I tried using the new radam, but I didnt see any improvements, it seems like the biggest advantage is when you are doing 30+ epochs.

On RAdam, I did not test RAdam due to that reason. What I did test was Ralamb, ImageNette/Woof Leaderboards - guidelines for proving new high scores?

I found that on small scale, Ralamb did improve it! I went from 78% on the IMDB sample to 80/81%, so I believe there’s enough there to try it out with the entire dataset. Also, my learning rate was increased to 3e-1. I did not test any smaller yet though. I will update this post in a little bit when I have

Do have any thoughts on how convert the SentencePiece parts back into words?

For converting back… I struggled with that for days when I was working on getting SP to work. I have nothing yet. I tried looking at their source-code to no avail I’m afraid :frowning:

So next plan is later this week I want to run it back with LabelSmoothing on the entire dataset (to put it to rest) and use Ralamb as I saw small-scale success

1 Like

Very cool! I look forward to your findings!

I have been struggling to find a way to convert SentencePiece parts back to words as well. I seems like it should be easy enough, I dont know why I cant find an example of people doing to. If I can’t find out how to do it I was going to try to create a Language model to do it, I will let you know if I figure it out.

Here is something to try opennmt I’m unsure if it works yet, as I can’t test it, but if you do let me know how it goes :slight_smile:

1 Like

It might be as easy as
detokenized = ''.join(pieces).replace('_', ' ') I will let you know if it works for me

1 Like

If you were to attempt this experiment again, what are the ideas that you would try in addition to the above.

I would like to extend this further.