Lesson 10: twice as small, twice as fast IMDB Classifier with almost as good accuracy! (95.2 v.s. 95.4)

Okay, so there are much complaint about slowness of lesson, so I did something about it.

I used google’s

and messed around it a lot.
I tried vocabulary size of 64, 32, 16 and 8 thousand, and makes almost no difference in accuracy.
There are other optimizations/fine tunings I did too.

So, all in all, I reduced the language model (encoder) traning time to 10 min per epoch,
and the classifier 3 min per epoch.

But I didn’t beat @jeremy. His was 95.4, mine best is 95.2. Sad.
I’be been trying to get that last 0.2 percent the last 2 weeks but I can’t.

I do want to write a blog about it, but my result is not as good as his. So I don’ tknow if I should.

And it’s a NLP classification task, what do I show in blog? like, plots of loss functions?
Just don’t know what’s there to write.

Also, shall I try to improve it? I’m pretty exhausted from trying to make it better. Is the last 0.2+% important/hard?

I need help… @rachel

10 Likes

Nice!
Did you do the backward direction as well? if yes, what is the best forward-only accuracy you managed to get using sentencepiece tokenization?

1 Like

Hey…! In my opinion you did great and definitely should write a blog about that. It’s because you have tried out something new. I mean it’s a great work of reducing the time per epoch.
About further trying to reduce the loss you may try some more things if some new ideas come your way. Meanwhile you should start writing about it.
And have you tried the neural cache pointers. If not , It might reduce the perplexity further and then you may have some sota PPL or might even boost Ur classifier as well (I guess)…

1 Like

Cool! What’s the language model perplexity/accuracy you’re getting on the sentencepiece tokenization?

1 Like

94.9 for forward direction single model.

2 Likes

neural cache pointer?

Please send paper. I’d love to read about it.

One thing I want to try that I didn’t is doing it without weight tying. But that will take so much time and energy that I don’t even know if I should try…

That’s really high accuracy: only one in 20 guesses is wrong. For your blogpost, it would be fun to generate some text from the language model.

NO. As the vocabulary size decreases, the language model becomes worse and worse at generating texts, at least I feel.

Also, this is not what seq2seq is supposed to be good at.

Good point, I should include some of them examples.

And even more examples. Thanks for the inspiration!

backward: yes.

Only thing is I used the backward that’s implemented by fast.ai library already, and didn’t check if it actually does model/training backwards…
(I should. I’m a bad and lazy person. )

So started blogging this one, got tokenization part done, at least.

pretty crude, but notebook is there to be downloaded.
If something doesn’t work, let me know.

1 Like

I played around with this when @sebastianruder and I were doing the ULMFiT experiments, and I found the same thing - I could quite beat the word-level models, but could get very close. Ensembling both should be even better!

2 Likes

YES.

But the whole point is to have the best single model, right?

Also, about training language models backward:

with forward + pre-trained weight, I can get about 30% accuracy within 1 epoch. with backward + pre-trained weight, I get 16% in 1 epoch.

trn_dl_back = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt, backwards=True)
val_dl_back = LanguageModelLoader(np.concatenate(val_lm), bs, bptt, backwards=True)
md_back = LanguageModelData(SPM_MODEL_PATH, 1, vs, trn_dl_back, val_dl_back, bs=bs, bptt=bptt, backwards=True)
... ...
learner_back= md_back.get_model(opt_fn, em_sz, nh, nl, 
dropouti=drops[0], dropout=drops[1], wdrop=drops[2], dropoute=drops[3], dropouth=drops[4])
//using bwd_wt103.h5
learner_back.model.load_state_dict(wgts)
lr=1e-3
lrs = lr
// backwards gives me accuracy of 16%, forward gives me 30%. big difference!!!
learner_back.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1)

Is this normal? Do you see it too?

Is the backward pre-trained weights bad or am I silly?

Another thought:

I highly suspect that SentencePiece will be a lot better for translation and text generation type of tasks.

Is my suspicion correct?

Did you do this for bot the LM and the classifier?

And if I understand correctly, you are saying here that both the LM and the classifier both performed similarly regardless of the vocab size? Or is it just the classifier?

I don’t think so - even the definition of what’s a “single model” is pretty fuzzy. The point is to have something that works well! :slight_smile:

You need a separate pre-trained model for backwards for each of wikitext LM, IMDb LM, and IMDb classifier. Also be sure you have both start-of-stream and end-of-stream tokens (since they impact the classifier). You should find fwd and bwd accuracies are about the same, when everything is working correctly.

2 Likes

I haven’t tried these.

Hey Yang … Sorry I was bit busy… This should be useful for you to get hold of Neural Cache Model…

1 Like

And here is the blog post that goes with it :wink:

2 Likes

Awesome job explaining pointer cache. Loved reading your blog on Pointer Cache. Thanks.