Reproducing ULMFiT for full IMDB dataset

fabio · October 15, 2018, 12:58pm

I’m trying to reproduce the ULMFiT fine-tuning for the full IMDB dataset. The example provided in http://docs.fast.ai/text.html uses a much smaller dataset. I would like to train the full IMDB dataset. I tried to follow the imdb notebook, but it’s outdated. My next try was to combine the code from the new API with the hyperparameters from the notebook.

The first part (fine-tuning the language model) does not give the same results as the notebooks. Here’s my code:

from fastai import *
from fastai.text import *
from fastai.docs import *

IMDB_PATH = 'data/aclImdb/'

train_ds = TextDataset.from_folder(IMDB_PATH, name='train')
val_ds = TextDataset.from_folder(IMDB_PATH, name='test')
data_lm = text_data_from_folder(Path(IMDB_PATH), data_func=lm_data, valid='test', bs=32)

learn = RNNLearner.language_model(data_lm, pretrained_fnames=['lstm_wt103', 'itos_wt103'],
                                  drop_mult=0.7)

lrs=1e-3
learn.freeze_to(-1)
learn.fit_one_cycle(1, lrs/2, div_factor=32, pct_start=0.5, wd=1e-7)
learn.save('lm_last_ft')
learn.load('lm_last_ft')

learn.unfreeze()
learn.fit_one_cycle(15, max_lr=lrs, wd=1e-7, div_factor=20, pct_start=0.1)

learn.save('lm1')
learn.save_encoder('lm1_enc')

For instance, the first learn.fit_one_cycle gives an accuracy of 0.23, versus 0.28 in the notebook.

After the unfreezing, if I train for 15 epochs, I’m getting an accuracy of ~0.302 vs ~0.312 from the notebook. In the classification, I’m getting ~0.62 vs ~0.92. The classification code is the following:

data_lm = text_data_from_folder(Path(IMDB_PATH), data_func=lm_data, train='train', valid='test', bs=32)
data_clas = text_data_from_folder(Path(IMDB_PATH), data_func=classifier_data, train='train_sup', valid='test',
                                   vocab=data_lm.train_ds.vocab, bs=32)

learn = RNNLearner.classifier(data_clas, drop_mult=0.5)
learn.load_encoder('lm1_enc')
learn.freeze_to(-1)
#lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])
learn.fit_one_cycle(4, 2e-3, div_factor=8, pct_start=0.33)

Note: train_sup is the original IMDB dataset train directory without the unsupervised folder.

What am I doing wrong?