I am trying to create a pretrained ‘language’ model with ULMFiT (imdb_scripts/pretrain_lm.py
) except my dataset is not natural language and also has few (< 10) tokens . The intention is to discover/uncover structure much in the same way a language model ‘learns’ language structure (spelling, punctuation, grammar etc.).
Unfortunately the model is underfitting. I have tried removing dropout and weight decay as suggested by Jeremy (link) but still have considerable underfitting:
...
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*0.0
learner,crit = get_learner(drops, n_negs, sampled, md, em_sz, nh, nl, opt_fn, tprs)
learner.metrics = [accuracy]
lrs = np.array([lr/6,lr/3,lr,lr])
wd=0.0
learner.fit(lrs, 1, wds=wd, use_clr=(32,10), cycle_len=20)
epoch trn_loss val_loss accuracy
0 2.197207 1.38649 0.183675
1 2.197169 1.38592 0.311834
2 2.196648 1.385048 0.305813
3 2.195991 1.384447 0.311121
4 2.195306 1.384221 0.31198
5 2.194579 1.382345 0.312493
6 2.193829 1.381151 0.328261
7 2.193109 1.379514 0.320108
8 2.192414 1.379704 0.322971
9 2.191739 1.378689 0.322319
10 2.191141 1.377697 0.33284
11 2.190573 1.376516 0.33507
12 2.189972 1.375447 0.335725
13 2.189114 1.374787 0.339102
14 2.188068 1.373842 0.33594
15 2.187361 1.373246 0.334264
16 2.186675 1.372658 0.338378
17 2.186194 1.372636 0.340511
18 2.185888 1.372562 0.33815
19 2.185693 1.371837 0.338878
I have tried increasing the embedding size, number of layers and number of hidden activations per layer all without much improvement.
Any ideas would be greatly appreciated.