I’ve noticed the same thing when training on my own data. Not sure why, but feels like the model at that specific architecture and lr gets stuck in some sort of horrible area. Or maybe it’s how it’s initialized, not sure.

Also, this method I’ve found depends a lot on the ‘width’ of the categorical variables. I’ve found that when my variables have very little number of categories I can get better performance for allowing larger embeddings. Probably not the best idea but it works.

Also, try using this before fit and let me know if it does better.

- First create your training phases, I define 4 of them here starting with RMSprop and switching to adamW, use 1-cycle with differential learning rates. Remember to change the settings to fit the number of layers you’ve assigned. Feel free to play around with the settings. Also, try to see when your model overfits and run until there, performance always seems to deteriorate when training loss gets much lower than validation.

def phases_1cycle_discriminative(cycle_len,lr,div,wds, pct,max_mom,min_mom):

…lrs = np.array([lr/100, lr/10, lr])

…return [TrainingPhase(epochs=(cycle_len * (1-pct) / 4), opt_fn=optim.RMSprop, lr=(lrs/div,lrs), lr_decay=DecayType.LINEAR,

momentum=(max_mom,min_mom), momentum_decay=DecayType.LINEAR, wd_loss=False),

TrainingPhase(epochs=(cycle_len * (1-pct) / 2), opt_fn=optim.Adam, lr=lrs, lr_decay=DecayType.NO,

momentum=min_mom, momentum_decay=DecayType.NO, wds=wds, wd_loss=False),

TrainingPhase(epochs=(cycle_len * (1-pct) / 4), opt_fn=optim.Adam, lr=(lrs,lrs/div),

wds=wds, wd_loss=False, lr_decay=DecayType.LINEAR, momentum=(min_mom,max_mom),

momentum_decay=DecayType.LINEAR),

TrainingPhase(epochs=(cycle_len * pct), opt_fn=optim.Adam, lr=(lrs/div, lrs/div), wds=wds, wd_loss=False,

lr_decay=DecayType.COSINE, momentum=max_mom, momentum_decay=DecayType.NO)]

- After that you can just call the functon and create your training phases (change to your learning rate here):

phases = phases_1cycle_discriminative(cycle_len=8, lr=5e-3, div=10, wds=1.26e-7, pct=0.1, max_mom=0.99, min_mom=0.85)

- After that all you have to do is fit your model, with one more extra fastai awesomeness: stochastic weight averaging:

m.fit_opt_sched(phases, use_swa=True, swa_start=3)

- You can visualize your learning rate and momentum schedule with the following command:

m.sched.plot_lr(show_text=False, show_moms=True)

That’s all! Let me know if that helps! In any case, it’s fun using fastai’s arsenal.

Regards,

Theodore.

P.S.: I don’t take any credit for the above, it’s all based on @sgugger’s TrainingPhase API.