I’ve reduced my vocab size to 30000 - now each epoch takes about 2 hs (as opposite to 3 hs).
I’m following 1cycle, with most parameters similar to the ones used by @sgugger, except for dropout (constant factor = 0.05, just to compensate for higher learning rate), and now, with 2 epochs, I got a validation error a little closer to the French and Spanish language models shown in this thread (I don’t think it’s relevant, but coincidentially PT, FR and ES are latin languages):
epoch trn_loss val_loss accuracy
0 3.812938 3.917015 0.273221
1 3.622323 3.618101 0.306443
At this moment, I’m running the 1st epoch of 10 (cycle len = 10), so I’ll post the results tomorrow.
I ran lr_find2() with 200 iterations and also found a large learning rate (5.0), as shown below:
Curiously, I ran lr_find2() with 400 iterations and a completely different result appeared:
I guess what may cause such different plots just by changing iterations number (something related to the rate scheduling, don’t remember the details from the paper right now). Finally I chose the higher learning rate (5.0) to train for the remaining epochs.