How to prevent underfitting in NLP?

mmr · December 9, 2017, 3:49am

I am solving this problem using NLP notebook from Week 4. The dataset is considerably large. Currently I am using a p3.2x large instance. I have been running the notebook for something around 2 hours. I am doing the language modelling portion of the notebook.

My first iteration -

lr = 3e-3
learner.fit(lr, 3, wds=1e-6, cycle_len=1, cycle_mult=2)

[ 0.       4.01247  3.77574]                                   
[ 1.       3.62427  3.37876]                                  
[ 2.       3.54717  3.23648]                                  
[ 3.       3.50206  3.26325]                                  
[ 4.       3.40787  3.13808]                                  
[ 5.       3.37264  3.04037]                                  
[ 6.       3.34219  3.01174]

My second iteration –

   lr = 3e-3
   learner.fit(lr, 3, wds=1e-6, cycle_len=1, cycle_mult=2)

   [0.       3.39474    3.01604]
   [1.       3.34584    3.07711]
   [2.       3.30852    2.96716]
   [3.       3.34519    3.10738]
   [4.       3.29975    3.01878]

I believe I am suffering from underfitting . One of the ways to improve underfitting is to increase number of hidden layers and another way might be to increase number of activations per layers. I am not sure.
Increasing size of embedding layer might allow to improve the contextual information of each of the word. Does it prevent underfitting?
I am not touching the dropout portion as for now, since I believe the only thing I can do is to dropout more nodes - which would decrease overfitting.
Any ideas for somebody who have tinkered with the parameters?

ramesh · December 9, 2017, 3:56am

Few Ideas -

Make the problem simpler, by taking only two classes (that you think is quite far from each other) and see if the model learns to separate them.
Increase Learning Rate…may be in a Local Minima
Don’t do any Weight Decay (or regularization) if you are under-fitting

mmr · December 9, 2017, 4:03am

This is the learning rate finder curve . I took it after the first iteration.

foo

I believe I need to be more aggressive with the learning rate - but then again I don’t want to overfit.

mmr · December 9, 2017, 8:46am

This is the original learning curve .
I tried to be more aggressive with the learning rate. I used 1e-2 - still, it is underfitting.
I know changing the number of hidden layers or the activations might improve the problem.
But, it is computationally more expensive. I takes long amount of time , if it try to increase any one of the parameters.
I think Jeremy might have something to say about it.
hello_pic

jeremy · December 9, 2017, 10:08pm

Yeah you should always try to overfit first. So reduce LR to 1e-3, remove all dropout, and set weight decay to 1e-7. Does that overfit? Once you get it overfitting, you can start adding more regularization etc.

mmr · December 10, 2017, 7:45pm

Hi Jeremy - the dropouts that you used out here - do they have to have certain ratios among them -

learner = md.get_model(opt_fn, em_sz, nh, nl,
              dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)

jeremy · December 10, 2017, 8:03pm

No, although those ratios seem to work pretty well. I took the AWD LSTM paper as guidance in picking dropout amounts.

mmr · December 14, 2017, 2:33pm

Hi Jeremy - trained my language model . I am pretty happy with the result. The language model even though was trained on documents of cancer mutation description threw me something remarkable and funny -

Seed - I will love you forever, my love …

, japan . phone : 81 - 3 - 717 - 717 - 717 - monocytic ; fax : 81 - 22 - 717 - 717 - monocytic ; fax : 81 - 22 - 717 - 717 - monocytic ; fax : 81 - 22 - 717 - 717 …

Statement from website on cancer -

Seed - The most commonly mutated gene in people who have cancer is p53. In fact, more than 50% of all cancers involve a missing or damaged p53 gene.

n = 1 ) . ( c ) the frequency of p53 mutations in the majority of cases ( n = 1 ) . the frequency of p53 mutations in the p53 gene is indicated . ( c ) the frequency of p53 mutations in the p53 gene in the primary tumor and the frequency of p53 mutations in the p53 gene in the p53 gene . the frequency of p53 mutations in the p53 gene is shown in the lower panel . ( c ) the frequency of p53 mutations in the p53 gene in the primary tumours ( n = 1 ) . the frequency of p53 mutations in the p53 gene is indicated in red . ( c ) the frequency of p53 mutations in the p53 gene in the primary tumours ( n = 1 ) . ( c ) the frequency of p53 mutations in …

This is freakisly remarkable now -

Seed - HER2, which is a specialized protein that controls cancer growth and spread, and it is found on some cancer cells, such as breast and ovarian cancer cells …

to investigate the role of the mutant protein in the development of human cancer , we generated a mouse model of human cancer in which the human tumor suppressor gene encodes for the c - terminal portion of the protein . the human tumor suppressor gene pten is a member of the family of members of the family of proteins , which is thought to be a key regulator of cell growth and differentiation ( 1 ) . the pten protein is involved in the regulation of cell growth and differentiation , and the cell cycle is regulated by the phosphorylation of the lipid phosphatase activity of pten ( 4 ) . pten is a tumor suppressor gene in the human genome , and its expression is associated with a variety of cellular processes , including cell cycle progression , apoptosis , apoptosis , and cell cycle progression ( 1 …

I trained the model for couple of hours in a p3 instance. It barely overfitted. Resource and time permitting, I could have improved the model further. But that is for future work.

Let me see how good is it in predicting.