Ohh Thanks that was the dilemma for me. I will decrease the dropouts a bit and train for few more epochs.

# Using use_clr_beta and new plotting tools

**wyquek**(魏璎珞) #42

Hi @sgugger

I’ve been reading your The 1cycle policy article. I can understand the high learning rates at the middle-cycle helping to regularize and the usual low learning rates at the end-cycle, but I can’t wrap my head around why a small lr at the beginning-cycle would help. I can’t find the why for the warmup.

Anyway to point me to a(n) article/blog/etc that explains why using small lr during warmup helps us to later reach super-convergence?

Edit: I think I kinda found the answer in http://teleported.in/posts/cyclic-learning-rate/

**EinAeffchen**(Leon Dummer) #43

Is there an implementation of LARS or did someone do experiments proving LARS as effective already?