Replicating exact IMDb results from Howard and Ruder (2018)

alexandres · May 18, 2019, 1:54pm

Hi all,

For academic purposes, I’m trying to replicate the exact results from Howard and Ruder (2018) IMDb results using the code from these scripts but I’m getting worse results (accuracy ~93.7% rather than 94.7%). All experiments use only forward LM. Pinging @jeremy since he’s best equipped to answer this.

Commands:

BS=36
EPOCHS=50
python -u finetune_lm.py ~/.fastai/data/imdb_full_scripts/imdb_lm ~/.fastai/data/imdb_full_scripts/imdb_lm --lm-id $LM_ID --cuda-id 0 --cl $EPOCHS --lr 4e-3 --bs $BS
python -u train_clas.py ~/.fastai/data/imdb_full_scripts/imdb_lm --lm-id $LM_ID --cuda-id 0 --cl $EPOCHS --bs $BS
python -u eval_clas.py ~/.fastai/data/imdb_full_scripts/imdb_lm --cuda-id 0 --lm-id $LM_ID --clas-id ${LM_ID}_ --bs $BS

I notice some divergence from the paper in:

finetune_lm.py#91:

drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*dropmult

finetune_lm.py#96:

learner.clip=0.3

finetune_lm.py#98:

wd=1e-7

finetune_lm.py#100:

lrs = np.array([lr/6,lr/3,lr,lr/2]) if use_discriminative else lr

train_clas.py#40:

opt_fn = partial(optim.Adam, betas=(0.8, 0.99))

train_clas.py#40:

learn.clip=25.

train_clas.py#89:

lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])

train_clas.py#92:

wd = 1e-6

tok2id.py#5:

def tok2id(dir_path, max_vocab=30000, min_freq=1)

There’s no mention of either clipping, weight decay, or vocab size in the paper so I don’t know if these values are correct. Using clip=25 with the default learning rate in training the classifier leads to divergence (50% accuracy after a couple of epochs). Changing it to .25, I get ~93.7% accuracy.

Shouldn’t lrs be the same for both finetune_lm.py#100 and train_clas.py#89? The paper suggests they are the same.

Finally, is the 94.7% accuracy model from the paper using 25k labeled examples for finetuning the LM or is it using the 75k labeled+unlabeled examples? In the analysis section (5) of the paper, there is mention of using these additional samples, but no mention of it in section (4).

I had a look at the notebook which achieves 94.7%, but the values and processing steps seem to diverge from the paper even more than the scripts.

Thanks!

jeremy · May 18, 2019, 9:21pm

It’s been a long time since I looked at that! I think @sebastianruder and @piotr.czapla may have more up to date info than me.

alexandres · May 18, 2019, 9:52pm

Thank you for the quick reply Jeremy!

Aside from the codebase, can you comment on the hyperparameters I listed and more importantly on whether unlabeled examples were used in finetuning the LM in section (4) of the paper?

Thanks!

piotr.czapla · May 19, 2019, 10:34am

Hi @alexandres I guess the paper intention was to show that the unsupervised pretraining works for NLP as at that time the general consensus was that it isn’t, in such circumstances the exact parameters might not be recorded as there were secondary to the concept. The concept works and you can achieve similar results using fastai v1.0 (I’ve got 94.6% last time I’ve tired), although fastai 1.0 is different in many aspects from original ulmfit. So all boils down to the reason you are trying to replicate the algorithm. If you are after histroical record then playing with 0.7 make sense although I haven’t tried to replicate the results.

If you are after an algorithm that is similar to original ULMFiT that works you can simply use the fastai 1.0 and hyper parameters from our repo ulmfit-multilingual. The performance depends on the weight initialisation and I’ve only recently fixed the random seeds so I can’t give you the exact command to reproduce ULMFiT but I can assist you if you want to run a few experiments to find out which random seeds give the best results on IMDB.

alexandres · May 22, 2019, 6:42pm

Hi @piotr.czapla!

Thank you for pointing me to ulmfit-multilingual. I copied the same hyperparameters over to imdb_scripts and was able to achieve the 94.7 result. I suspect the biggest culprit was the cyclical momentum which was missing from the scripts. A great advantage of using your values is that the training requires only 5 epochs.

So all boils down to the reason you are trying to replicate the algorithm. If you are after histroical record then playing with 0.7 make sense although I haven’t tried to replicate the results.

I’m experimenting with a modification to ULMFit which I hope to be able to publish. Referring to the original paper for hyperparameters would save paper space and (I think) prevent some raised eyebrows, ie “why have the authors used different hyperparameters? Is it to favor their method?”

Unfortunately, I don’t think there is an exact reference implementation so this might be impossible. Yours seems to converge quickest, but it is different from the notebooks shared by @jeremy, which in their turns differ slightly from each other.

The concept works and you can achieve similar results using fastai v1.0 (I’ve got 94.6% last time I’ve tired), although fastai 1.0 is different in many aspects from original ulmfit.

I actually started with fastai v1.0 but I was getting CUDA errors for small batches. I then tried the v.7 notebooks and script and didn’t have this issue so I stuck with that.

If there’s interest, I’d be happy to push the changes back to imdb_scripts.

piotr.czapla · May 23, 2019, 3:07pm

I’m happy that it worked! re. PR for imdb_scripts It might make sense for other researchers to have something close to original available. But that is up to Jeremy. If you have better results using some new kind RNN I’m sure everyone would be super interested to have it ported to fastai.