I tried to use the pretrain_lm.py code (fastai/courses/dl2/imdb_scripts/pretrain_lm.py) to pre-train the language model on Wikitext-103.
I downloaded the WikiText-103 word level (181 MB) here. I removed headers (e.g., "= = = Modern history = = = ") from the dataset and only used each paragraph as a training example. After that, I followed the same preprocess procedure as for IMDB (https://github.com/fastai/fastai/blob/master/courses/dl2/imdb.ipynb).
I run pretrain_lm.py with cl = 50, lr = 0.001 and other parameters as default.
Vocab size = 238,462
I only got a perplexity of math.exp(4.709236) ~ 111.0. I couldn’t find any information about the perplexity of the pre-trained LM model in the ULMFiT paper to see if my result is reasonable or not. Anyone knows the perplexity of the pre-trained LM model on Wikitext-103? I hope the authors (@sebastianruder, @jeremy) could share the parameters (e.g., cl, lr) they used, the way they pre-processed Wikitext-103, and the perplexity of the pre-trained LM should be reported in the paper.
Thanks @jeremy so much for the response. So what is a reasonable perplexity for the pre-trained LM model on Wikitext-103? Could you share some of the parameters you used (e.g., drops, cl, lr)? I used the same vocab size as the pre-trained model at http://files.fast.ai/models/wt103/.
The page you linked for downloading the dataset also has the published results for the SOTA perplexity on the dataset. Though the page is a bit dated, it should still give you a decent idea as to what the pretrained LM perplexity should be. IIRC, Jeremy’s model had somewhere in the 50s. You could always download the pretrained model and evaluate.
Thanks @nickl for pointing out the papers.
Following @jeremy’s suggestions, I tried to reduce dropout a lot. In particular, I reduced the dropout factor from 0.5 to smaller values like 0.1, 0.2 in
However, since @jeremy’s model can achieve around 50s even with a large vocabulary of 238K, I am wondering what makes the difference. I tried many dropout values but could not get below 100 with a vocabulary of 238K.
Oh I saw the pre-trained AWD LSTM model at http://files.fast.ai/models/wt103/ with a vocabulary of 238K (itos_wt103.pkl) so I thought you used that vocabulary.