Ohh Thanks that was the dilemma for me. I will decrease the dropouts a bit and train for few more epochs.
I’ve been reading your The 1cycle policy article. I can understand the high learning rates at the middle-cycle helping to regularize and the usual low learning rates at the end-cycle, but I can’t wrap my head around why a small lr at the beginning-cycle would help. I can’t find the why for the warmup.
Anyway to point me to a(n) article/blog/etc that explains why using small lr during warmup helps us to later reach super-convergence?
Edit: I think I kinda found the answer in http://teleported.in/posts/cyclic-learning-rate/
Is there an implementation of LARS or did someone do experiments proving LARS as effective already?