Problem underfitting ULMFit with non natural language data

sthorn · August 6, 2018, 2:56pm

I am trying to create a pretrained ‘language’ model with ULMFiT (imdb_scripts/pretrain_lm.py) except my dataset is not natural language and also has few (< 10) tokens . The intention is to discover/uncover structure much in the same way a language model ‘learns’ language structure (spelling, punctuation, grammar etc.).

Unfortunately the model is underfitting. I have tried removing dropout and weight decay as suggested by Jeremy (link) but still have considerable underfitting:

...
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*0.0
learner,crit = get_learner(drops, n_negs, sampled, md, em_sz, nh, nl, opt_fn, tprs) 
learner.metrics = [accuracy]
lrs = np.array([lr/6,lr/3,lr,lr])
wd=0.0
learner.fit(lrs, 1, wds=wd, use_clr=(32,10), cycle_len=20)

epoch      trn_loss   val_loss   accuracy
    0      2.197207   1.38649    0.183675
    1      2.197169   1.38592    0.311834
    2      2.196648   1.385048   0.305813
    3      2.195991   1.384447   0.311121
    4      2.195306   1.384221   0.31198
    5      2.194579   1.382345   0.312493
    6      2.193829   1.381151   0.328261
    7      2.193109   1.379514   0.320108
    8      2.192414   1.379704   0.322971
    9      2.191739   1.378689   0.322319
    10     2.191141   1.377697   0.33284
    11     2.190573   1.376516   0.33507
    12     2.189972   1.375447   0.335725
    13     2.189114   1.374787   0.339102
    14     2.188068   1.373842   0.33594
    15     2.187361   1.373246   0.334264
    16     2.186675   1.372658   0.338378
    17     2.186194   1.372636   0.340511
    18     2.185888   1.372562   0.33815
    19     2.185693   1.371837   0.338878

I have tried increasing the embedding size, number of layers and number of hidden activations per layer all without much improvement.

Any ideas would be greatly appreciated.

rother · August 11, 2018, 11:18am

What learning rate do you use? I would probably start with a fixed learning rate and not use discriminative learning rates (np.array([lr/6,lr/3,lr,lr])). My first step is usually to run the learing rate finder and save the graph it outputs, then I scribble down some candidate learning rates that look promising and try fitting with them to get a general feel.

sthorn · August 12, 2018, 1:40pm

Thanks Kristian. I will experiment with learning rates and report back.