Language Model Perplexity on Wikitext-103

Samuel · September 3, 2018, 2:20pm

I tried to use the pretrain_lm.py code (fastai/courses/dl2/imdb_scripts/pretrain_lm.py) to pre-train the language model on Wikitext-103.

I downloaded the WikiText-103 word level (181 MB) here. I removed headers (e.g., "= = = Modern history = = = ") from the dataset and only used each paragraph as a training example. After that, I followed the same preprocess procedure as for IMDB (https://github.com/fastai/fastai/blob/master/courses/dl2/imdb.ipynb).

I run pretrain_lm.py with cl = 50, lr = 0.001 and other parameters as default.
Vocab size = 238,462

cuda_id 0; cl 50; bs 64; backwards False; lr 0.001; sampled True; pretrain_id wt103; 
epoch      trn_loss   val_loss   accuracy   
    0      5.776591   6.44552    0.259953  
    1      5.233588   5.772168   0.301972  
    2      5.016692   5.454351   0.319021  
    3      4.907773   5.25423    0.32643   
    4      4.843116   5.144812   0.331993  
    5      4.804867   5.04327    0.337619  
    6      4.752793   5.010323   0.341568  
    7      4.762354   4.963723   0.343884  
    8      4.720873   4.979016   0.345635  
    9      4.696115   4.966046   0.347833  
    10     4.695888   4.958513   0.349572  
    11     4.685831   4.947535   0.350613  
    12     4.662152   4.927366   0.351334  
    13     4.666957   4.902202   0.352362  
    14     4.656022   4.907493   0.35252   
    15     4.653547   4.867369   0.353868  
    16     4.62071    4.83421    0.353882  
    17     4.63349    4.860735   0.355546  
    18     4.62876    4.86727    0.355491  
    19     4.618118   4.878151   0.356088  
    20     4.618727   4.877727   0.356831  
    21     4.609477   4.861039   0.356766  
    22     4.603371   4.846594   0.358499  
    23     4.595969   4.832681   0.356769  
    24     4.596219   4.815605   0.357935  
    25     4.603611   4.799423   0.358387  
    26     4.598209   4.78185    0.358921  
    27     4.565573   4.762143   0.359536  
    28     4.565203   4.776752   0.359573  
    29     4.558614   4.774239   0.360059  
    30     4.601854   4.811212   0.360294  
    31     4.571514   4.793552   0.360857  
    32     4.571862   4.794957   0.360936  
    33     4.570028   4.79349    0.360786  
    34     4.561466   4.779224   0.361973  
    35     4.534542   4.771871   0.362523  
    36     4.609946   4.794087   0.362116  
    37     4.594039   4.783216   0.362705  
    38     4.548162   4.77138    0.362645  
    39     4.53831    4.762839   0.362599  
    40     4.553306   4.760886   0.363476  
    41     4.560196   4.766238   0.363324  
    42     4.557774   4.762862   0.363454  
    43     4.519061   4.739368   0.3636    
    44     4.533007   4.731561   0.364497  
    45     4.499698   4.721955   0.364292  
    46     4.503361   4.713953   0.365411  
    47     4.522731   4.724211   0.364849  
    48     4.538056   4.722945   0.364634  
    49     4.509816   4.709236   0.365488

I only got a perplexity of math.exp(4.709236) ~ 111.0. I couldn’t find any information about the perplexity of the pre-trained LM model in the ULMFiT paper to see if my result is reasonable or not. Anyone knows the perplexity of the pre-trained LM model on Wikitext-103? I hope the authors (@sebastianruder, @jeremy) could share the parameters (e.g., cl, lr) they used, the way they pre-processed Wikitext-103, and the perplexity of the pre-trained LM should be reported in the paper.

I saw a similar question at ULMFiT pretraining · Issue #674 · fastai/fastai · GitHub but @sgugger said it should be posted here.

Thanks!

jeremy · September 4, 2018, 12:03am

You’re underfitting. So reduce dropout a lot. Also, you probably want a smaller vocab size (<50,000) to save time.

Samuel · September 4, 2018, 5:07am

Thanks @jeremy so much for the response. So what is a reasonable perplexity for the pre-trained LM model on Wikitext-103? Could you share some of the parameters you used (e.g., drops, cl, lr)? I used the same vocab size as the pre-trained model at http://files.fast.ai/models/wt103/.

keratin · September 4, 2018, 2:55pm

The page you linked for downloading the dataset also has the published results for the SOTA perplexity on the dataset. Though the page is a bit dated, it should still give you a decent idea as to what the pretrained LM perplexity should be. IIRC, Jeremy’s model had somewhere in the 50s. You could always download the pretrained model and evaluate.

Samuel · September 4, 2018, 3:47pm

Yes, I saw the best perplexity on Wikitext-103 on the dataset page is 40.8 but it’s in 2016 and I think Jeremy’s model will outperform this number.

nickl · September 8, 2018, 1:02pm

In “Scalable Language Modeling: WikiText-103 on a Single GPU in 12 hours” Stephen Merity et.al published a result of 33 on the test set in Feb 2018.

In https://arxiv.org/abs/1803.10049, DeepMind got to 29.2 in March 2018

Samuel · September 9, 2018, 4:21am

@jeremy @nickl

Thanks @nickl for pointing out the papers.
Following @jeremy’s suggestions, I tried to reduce dropout a lot. In particular, I reduced the dropout factor from 0.5 to smaller values like 0.1, 0.2 in

drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*0.5

However, I still got very high perplexity

True; pretrain_id wt103; 
epoch      trn_loss   val_loss   accuracy   
    0      5.845019   6.576387   0.252422  
    1      5.247766   5.840117   0.296937  
    2      5.003932   5.513449   0.31546   
    3      4.869153   5.273612   0.324995  
    4      4.809974   5.180764   0.331419  
    5      4.742868   5.096502   0.336746  
    6      4.704509   5.04951    0.340649  
    7      4.673324   5.011071   0.343345  
    8      4.662834   4.977611   0.344763  
    9      4.633989   4.946137   0.346698  
    10     4.625279   4.929208   0.346979  
    11     4.591506   4.892892   0.348745  
    12     4.583871   4.864686   0.350179  
    13     4.578854   4.886134   0.349845  
    14     4.571323   4.875761   0.351102  
    15     4.579498   4.893563   0.351909  
    16     4.582234   4.899546   0.352307  
    17     4.555787   4.872988   0.352898  
    18     4.540921   4.861825   0.353688  
    19     4.548403   4.866682   0.35427   
    20     4.545599   4.869778   0.355034  
    21     4.520052   4.824961   0.355175  
    22     4.522821   4.824711   0.355264  
    23     4.508635   4.837776   0.355532  
    24     4.510644   4.825253   0.357016  
    25     4.544512   4.855726   0.356479  
    26     4.517315   4.837646   0.35769   
    27     4.514285   4.810008   0.357919  
    28     4.484106   4.801606   0.358101  
    29     4.506718   4.812573   0.359011  
    30     4.486608   4.778563   0.359246  
    31     4.494054   4.798867   0.35862   
    32     4.494679   4.792725   0.359378  
    33     4.472218   4.773794   0.359528  
    34     4.467829   4.754905   0.360565  
    35     4.459146   4.754302   0.359633  
    36     4.517514   4.773191   0.360589  
    37     4.44821    4.73521    0.360629  
    38     4.449836   4.740594   0.360651  
    39     4.440507   4.728994   0.360727  
    40     4.441559   4.720591   0.361021  
    41     4.450603   4.729152   0.36089   
    42     4.444834   4.726207   0.361426  
    43     4.462478   4.723183   0.361536  
    44     4.442287   4.711084   0.361968  
    45     4.453724   4.716874   0.362443  
    46     4.427734   4.701541   0.362813  
    47     4.481296   4.715812   0.362387  
    48     4.442011   4.704367   0.362222  
    49     4.398204   4.696147   0.362585

perplexity = math.exp(4.696147) ~ 109.5.

If I use a small vocabulary size, say 50K I can get a perplexity of 43.2

epoch      trn_loss   val_loss   accuracy   
    0      5.642537   4.991986   0.262467  
    1      5.061428   4.388178   0.303964  
    2      4.885501   4.208835   0.319357  
    3      4.755213   4.115963   0.328801  
    4      4.699133   4.057378   0.334531  
    5      4.6741     4.011937   0.337034  
    6      4.645893   3.980712   0.339181  
    7      4.620791   3.951969   0.34211   
    8      4.611351   3.936841   0.344021  
    9      4.591284   3.925729   0.345349  
    10     4.580472   3.906996   0.346435  
    11     4.569892   3.894666   0.347647  
    12     4.546604   3.872622   0.349308  
    13     4.552168   3.886345   0.348283  
    14     4.545922   3.872386   0.349758  
    15     4.534956   3.858267   0.351671  
    16     4.53277    3.86607    0.352228  
    17     4.525592   3.849118   0.352573  
    18     4.519817   3.848611   0.353222  
    19     4.50755    3.848291   0.354337  
    20     4.512732   3.835393   0.354123  
    21     4.515793   3.837786   0.354668  
    22     4.500342   3.832509   0.354783  
    23     4.492654   3.832298   0.355027  
    24     4.500384   3.833041   0.355207  
    25     4.484715   3.822299   0.355947  
    26     4.497915   3.81679    0.356971  
    27     4.486301   3.818892   0.357061  
    28     4.464191   3.81374    0.356887  
    29     4.475611   3.812614   0.357386  
    30     4.47205    3.807147   0.357697  
    31     4.463978   3.802773   0.357766  
    32     4.461278   3.807008   0.358019  
    33     4.460604   3.789323   0.359008  
    34     4.451277   3.791231   0.359253  
    35     4.468511   3.790296   0.3607    
    36     4.455692   3.788759   0.360117  
    37     4.451187   3.787117   0.35973   
    38     4.447499   3.785229   0.360013  
    39     4.455478   3.782362   0.361127  
    40     4.440027   3.78278    0.360778  
    41     4.432558   3.780896   0.361187  
    42     4.440022   3.775458   0.361726  
    43     4.438861   3.776113   0.36176   
    44     4.419476   3.776213   0.361911  
    45     4.422376   3.775223   0.362121  
    46     4.426181   3.77456    0.361536  
    47     4.415288   3.769983   0.362378  
    48     4.419284   3.765662   0.363066  
    49     4.40825    3.766797   0.363027

perplexity = math.exp(3.766797) ~ 43.2.

However, since @jeremy’s model can achieve around 50s even with a large vocabulary of 238K, I am wondering what makes the difference. I tried many dropout values but could not get below 100 with a vocabulary of 238K.

jeremy · September 9, 2018, 1:41pm

Ah no my model was only 30k vocab size - sorry for the confusion!

Samuel · September 9, 2018, 2:30pm

Oh I saw the pre-trained AWD LSTM model at http://files.fast.ai/models/wt103/ with a vocabulary of 238K (itos_wt103.pkl) so I thought you used that vocabulary.