Ohh Thanks that was the dilemma for me. I will decrease the dropouts a bit and train for few more epochs.
I’ve been reading your The 1cycle policy article. I can understand the high learning rates at the middle-cycle helping to regularize and the usual low learning rates at the end-cycle, but I can’t wrap my head around why a small lr at the beginning-cycle would help. I can’t find the why for the warmup.
Anyway to point me to a(n) article/blog/etc that explains why using small lr during warmup helps us to later reach super-convergence?
Edit: I think I kinda found the answer in http://teleported.in/posts/cyclic-learning-rate/
Is there an implementation of LARS or did someone do experiments proving LARS as effective already?
Thanks @sgugger for keeping people up with your research.
I think this topic is a bit old but I find it pretty interesting so I will post my questions here:
- @sgugger Did you end up posting your results somewhere?
- From your experiment with LARS did you find it gave significant better result?
- looking at the official implementation of fastai. it seems it does (at least) 2 things differently than the original paper. 1) It goes to 1/25 instead of 1/10 and 2) it use cosine annealing. I was wondering if there are other differences with the original paper and if so if there was anyone that could explain to me why theses differences? My guess is that it happened from empirical experimentation. If so, I’d be interested in seeing if anyone has a notebook that shows it.
- It seems from your blog article that the idea is to have one cycle during your whole training and divide the whole training in 3 phases (going up, down and the end). If so, the length of your cycle is directly dependent on how many epoch you want to train it on. However, I can’t find any good rule in term on how many epoch to train on. Usually, in the old day you’d use early stoping which would help to not train too long. However, I feel like it would be dangerous to use such a rule with 1cycle. Any practical advice on this?
We made changes to the original idea of 1cycle for something that works better across applications. It does come from empirical experiments.
The results on wikitext-2 were published with this blog post and you can find the scripts here for v0.7. In v1 it would be slightly different, but by adding mixed precision, you can basically train the AWD-LSTM on wikitext-2 in an hour.
As for the length of the cycle, well that is the one million dollar question. We are looking to find a way to determine it but for now, it’s only through various experiments.