I am solving this problem using NLP notebook from Week 4. The dataset is considerably large. Currently I am using a p3.2x large instance. I have been running the notebook for something around 2 hours. I am doing the language modelling portion of the notebook.
My first iteration -
lr = 3e-3 learner.fit(lr, 3, wds=1e-6, cycle_len=1, cycle_mult=2)
[ 0.       4.01247  3.77574]                                   
[ 1.       3.62427  3.37876]                                  
[ 2.       3.54717  3.23648]                                  
[ 3.       3.50206  3.26325]                                  
[ 4.       3.40787  3.13808]                                  
[ 5.       3.37264  3.04037]                                  
[ 6.       3.34219  3.01174]       
My second iteration –
lr = 3e-3 learner.fit(lr, 3, wds=1e-6, cycle_len=1, cycle_mult=2)
[0. 3.39474 3.01604] [1. 3.34584 3.07711] [2. 3.30852 2.96716] [3. 3.34519 3.10738] [4. 3.29975 3.01878]
I believe I am suffering from underfitting . One of the ways to improve underfitting is to increase number of hidden layers and another way might be to increase number of activations per layers. I am not sure.
Increasing size of embedding layer might allow to improve the contextual information of each of the word. Does it prevent underfitting?
I am not touching the dropout portion as for now, since I believe the only thing I can do is to dropout more nodes  - which would decrease overfitting.
Any ideas for somebody who have tinkered with the parameters?