One Cycle policy

Aadi_gupta · October 13, 2018, 10:57am

Hi everyone!
From last few weeks, I am trying to implement this fabulous paper https://arxiv.org/pdf/1803.09820.pdf by @Leslie in Tensorflow, but I am not able to get the optimal results mentioned in the paper. I have several doubts regarding the implementation since there is no good Tensorflow implementation of it.

During the LR range test, what momentum value we should be using, since this is the first step of the whole recipe (be it remain at a constant value of 0.9 or should it be linearly decreasing with the increase in LR)?
What value of min_lr and max_lr should we use, although in the original paper he suggested to start with very small LR of 1e-6 and increase it linearly or exponentially till the point where training loss starts diverging but in the report that I referenced earlier is using only a small LR range of let’s say 1e-2 to 1e-1 (similar to 1 cycle policy, so I am wondering how one came up with value of 1e-2).
To find optimal momentum, should we be using one cycle policy (max_lr found from step 1 and min_lr, one-tenth of max_lr) in which the max_mom is the hyperparameter and min_mom is 0.8 or 0.85? One more doubt, should we run the experiment for only half the cycle (in which LR is increasing from min value to max value only or run it for the entire cycle)?
I am also not getting a proper instability in the validation loss (all hyperparameters are converging to the same low loss value, suggesting of under fitted model). I am also attaching a pic of my experiment (for a search of dropout). My max LR from LR range test was coming put to be 10^(-1.1) = 0.08, so I ran the One cycle policy for only half the cycle. I kindly request you, people, to look into this doubt.

@jeremy @rachel @Leslie @radek

List item

I am not able to explain a few things to myself. I am observing that I can use much large learning rate compared to what found from LR range test (that exp is giving me a value of 0.08 but I am able to use LR up to 1 and reducing the overall epochs for training), then what’s the point of running LR range test if it’s not able to provide me the max LR?

Leslie · October 14, 2018, 12:57pm

Thank you for your interest in my work.

As you likely know from my report, I did my experiments with Caffe. I put some of my original files on github here GitHub - lnsmith54/hyperParam1. If you are not using it as a reference, you should.

I’d like to mention that since writing the report I realized that my work is closely related to several papers on large batch training. The first and one of the best papers on this is from FaceBook:
Goyal, Priya, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. “Accurate, large minibatch SGD: training imagenet in 1 hour.” arXiv preprint arXiv:1706.02677 (2017).
This paper offers many valuable suggestions for large batch training. The difference is that they focus on the large batch size (BS) and I focused on what is the best LR schedule.

I started using a cyclical momentum with my LR range test, typically in the range of 0.95 → 0.8. I found this empirically. Recently I started some new experiments on finding what is the best LR schedule for large batch size training. BTW, I will likely need to update my report. For example, I discovered a justification for cyclical momentum. The paper:
Smith, Samuel L., Pieter-Jan Kindermans, and Quoc V. Le. “Don’t Decay the Learning Rate, Increase the Batch Size.” arXiv preprint arXiv:1711.00489 (2017).
Suggests that the noise level g = c * LR / (BS *(1 – m)), where g is the noise level, c is a constant, LR is the learning rate, BS the batch size, and m is momentum. If you think of it as LR / (1 – m) = c, then increasing LR requires decreasing m. Currently, I have tied momentum to LR using this equation.

My experiments are also showing that LR * WD = c (approximately). This implies that as LR increases during 1Cycle, WD should decrease. My current version of 1Cycle includes a cyclical WD. My experiments show that doing this allows me to achieve near the state-of-the-art (SOTA) results while not doing this does not. This result is better than with cyclical momentum and, IMO, this could be a key factor in large BS and large LR training.

BTW, I am running my latest experiments with ShakeDrop from GitHub - imenurok/ShakeDrop: Torch implementation of the paper "ShakeDrop regularization" (https://arxiv.org/abs/1802.02375). because it achieves near SOTA for Cifar-10/100. This implementation uses Torch. I am also planning to confirm my results in Caffe and PyTorch.

Early in my experimentation I started the LR range test (LRR) at 0 but found that when the LR is too small, there was some underfitting. I showed an example in my report. After that, I would start the LRR at 0.1 (sometimes 0.01) to skip the unnecessary underfitting.

I don’t know why your LR range test indicates a maxLR of 0.08. It could be the model/architecture you are using. When I use a resnet with BN, I can typically use a range from 0.1 → 3. MaxLR depends on the BS. Or, as my test now show, it maxLR depends on BS, WD, and m. It could be any of these factors.

Best,
Leslie

Aadi_gupta · October 14, 2018, 4:13pm

Thanks a lot Leslie!
Just one quick question, do you apply weight decay to all the trainable variables (including the batch norm and biases also)?

Leslie · October 14, 2018, 4:30pm

As a general rule I download code from Github I want to replicate (i.e., ShakeDrop) and I run the way it was implemented. The large batch training literature recommends not using WD on BN, so if you are asking what your should do, don’t apply WD to BN.