Cyclical Layer Learning Rates (a research question)

That makes sense. I think @sgugger’s training phase library is probably a good starting point. It already does the discriminative learning rates (@jeremy’s name for linearly scaling learning rate per layer group) and it shouldn’t be too much work to modify it to make allow for scaling that over time or for applying different learning rate schedules per layer. I think it’s worth working towards the latter even if the first thing we decide to test is to scale a global learning rate linearly over time.

Do you have any thoughts about the architecture(s) and datasets that you think would be worth exploring with this?

I suggest starting with quick and simple tests. I often start with MNIST and Cifar with a few layer network (i.e., LeNet, etc.). Then I would try resnet with Cifar to see how changing to a deeper architecture with skip connections work. The main thing is to learn from every experiment. If it doesn’t work, one can sort it out with quick experiments. If it does work, we should eventually get to ImageNet.

Do you know which experiments to try? We discussed several in this thread.

@Leslie you may be interested to read the new research results from @sgugger that show that Adam and super convergence work well together, if you use the AdamW variant:

http://www.fast.ai/2018/07/02/adam-weight-decay/

Also, be sure to check out LARS, which has a nice approach to layer-wise LR:

@sgugger has done some initial research into LARS and super-convergence. I’m not sure where he’s up to with it so far.

2 Likes

Thanks for the pointers @jeremy. I am looking forward to reading both.

For LARS and super-convergence I have mixed results: sometimes it works beautifully, sometimes it crashes down the training completely.
For now I’ve only been able to make it work on CIFAR-10, not other tasks.

Have you tried reducing weight decay? I found it necessary to reduce other forms of regularization.

I am currently reading the LARS paper. I was unaware of its existence, even after my literature search (how did you and Jeremy find it?). It might well make this thread obsolete - if LARS is a better method than the layer-wise learning rate I suggested here. Do you have an opinion? Is LARS better than a manual or cyclical layer-wise learning rate approach?

1 Like

One more question: have you looked at the values of layer learning rates for each layer? I don’t see that in the paper. And my suggestions above for the LLR was based on intuition so I’d like experimental evidence that confirms or disproves my intuition.

Now that I’ve read the LARS paper I can say that this thread/research project is still worthwhile pursuing - maybe even more so than before.

My first question is: has anyone, particularly @sgugger, replicated the results in the LARS paper? That paper shows results for ImageNet, using a modified AlexNet with batch norm. But @sgugger said:

For anyone who hasn’t read the LARS paper, it suggests a layer-wise learning rate computed as:

lambda^l = eta * norm(w^l) / norm( gradient^l )

where eta is essentially a constant learning rate and the superscript ‘l’ means layer-wise. This makes it a layer-wise adaptive LR.

However, I immediately saw what I believe is an improvement I’d like to make to this. Based on theory, I think a better LLR is:
lambda^l = norm( MADw^l ) / norm( MADgradient^l )
where MAD means a moving average of the difference over iterations. IMO, it is the change in the weights and the change in the gradients that estimate the second derivative (i.e., the Hessian), which indicate the slope and learning rate. Is it clear why the rate of change of the weights and the gradient is what is important (think of the definition of derivatives in calculus)? The moving average is to smooth out the noise in the gradients. It would be informative to compare my version to LARS.

Also, I was dismayed that the LARS paper doesn’t seem to compare to Adam (or AdamW discussed in the latest blog post at http://www.fast.ai/2018/07/02/adam-weight-decay/). It weakens the paper to not more thoroughly compare the method.

Coming back to the topic of this post, I’d say it is worthwhile to start with manually setting LLR and adding a few more experiments. Obviously, we should compare to LARS. Also, we should compare manual setting to my version above. In addition, AdamW needs to be part of the experiments.

Finally, I’d like to say that I started this LLR thread for educational purposes for any of the fast.ai students who would like to experience my version of doing research (i.e., the thought experiment, searching the literature, designing and running experiments, observing and trying to understand the results, and perhaps writing a paper). For that reason, I’d like to continue this LLR “lesson” in the public forum. Is this interesting to anyone? Should we continue?

Best,
Leslie

4 Likes

I haven’t tried it on ImageNet yet (it’s slower than experimenting on CIFAR-10 so I always begin there). I’ll run experiments with your suggestion and report the results here.

Thank you @sgugger - I’d be quite interested in the results of your experiments with my suggestion for layer-wise learning rates. And please do share your results with all.

FYI: I’ve written to one of the LARS paper’s authors, Boris Ginsburg, with a few questions. In his reply, he attached a revised version of the algorithm in a draft of a paper that is called “Layer-wise Adaptive Rate Control for Training of Deep Networks” (I don’t feel I can take the liberty of sharing this draft but if you email Boris, I expect he will send it to you). This new version is called LARC. LARC differs from LARS in a couple of small ways:

  1. the global LR is clipped if it exceeds the smaller layer learning rate,
  2. when the global LR is small (near the end of training), it switches to SGD.

If you want to see the code, he sent the following links: “LARC is supported in NVIDIA caffe: https://github.com/NVIDIA/caffe/blob/caffe-0.17/src/caffe/solvers/sgd_solver.cpp and in Tensorflow: https://github.com/NVIDIA/OpenSeq2Seq/blob/master/open_seq2seq/optimizers/optimizers.py .”

From these two changes, I suspect there was a problem with instability (i.e., exploding gradients) that they are solving.

Allow me to make a tangential comment: IMO, the best training speed and generalization is obtained when remaining just below the instability line. I think 1cycle LR policy works well because convergence moves the stability line and the increasing LR pushes the training closer to this dynamic line. I’ve run experiments where the stepsize for 1cycle is too short and found it diverges. I’ve wondered if it might be better to run a couple of tests to find the minimum stepsize and use a slightly larger stepsize for the increasing LR but use @jeremy’s slanted triangular learning rate for the decreasing LR more slowly. It might be interesting for someone to test. Please let me know if anyone tries this.

Best,
Leslie

I have experimented a lot around this and the result I found is that it’s always better to spend an equal time increasing and decreasing, under the same budget in terms of epochs. The only time I found something worked slightly better when it was possible to increase the learning rate faster was to do a constant plateau at the highest learning rate (and plateau the momentum at the minimum value) before decreasing (that’s the last figure in the blog post).
My intuition is that exploring the loss function with large learning rates for a long time leads to finding a flatter minimum in the end, one that generalizes better (even if we don’t see the benefits of this exploration while it happens, since the validation loss doesn’t move much).

Also, for the problem of exploding gradients, gradient clipping has worked pretty well with LMs models, and I’m sure it could also be applied in other situations to push this stability line a bit further.

I think continuing in open is a great idea. This discussion adds a lot of value, is very educational, and it’s depth is beyond nearly anything I have come across on the Internet. As a completely tangential thought - there seems to be something special about these forums, hopefully we can keep them that way :slight_smile: .

Gradient clipping has been the secret formula for me for transfer learning with CNNs (especially resnets) and the 1cycle policy. This sounds to me like a very interesting angle to continue to explore. The value I used (partially based on plots produced in some other thread by @sgugger) was 1e-1 for the clipping and it allowed me to train with absurdly high learning rates vs the batch size.

The meaning of instability here is important (as in - how do we define it? is it a point beyond which a model starts to diverge? if so, might be crossing the instability line is not that bad as long as you use gradient clipping). I have had best results when pushing the training to the limits - having a high learning rate so that the loss just before the peak of the cycle goes haywire. That is I think where the clipping comes into play.

I am sorry if this all sounds very wishy-washy - its nothing more than intuition at this point that I have built up while working with the 1cycle policy. But I’ll see if there is anything more tangible on this I can come up with - I just started playing with AdamW as described in the post by Sylvain and Jeremy. This is a really big development imho - can’t wait to start using it :slight_smile:

Leslie - I never got a chance to let you know and thank you properly, but I used your 1cycle policy to win a Kaggle competition not too long ago. Here is a write up on the approach and here are slides that I think may have been presented in the fgcv5 workshop at cvpr.

I feel the 1cycle policy doesn’t only give one an edge but it also increases ones ability to actually do the training with limited hardware. Having the option to use lr as a regularizer and internalizing the notion that training with a non-decreasing test loss is something that you might want to do (and why that might be) are beyond helpful and have changed how I train NNs dramatically. Thank you very much for sharing your work and the thought processes that go into making sense of what happens during training.

1 Like

Please, do continue! I enjoy every bit of this type discussion, even though I cannot fully comprehend every single detail.

Thank you for sharing this. I’ve wondered since seeing the slanted triangular LR in @jeremy’s paper but hadn’t the time to test it myself.

Did you directly compare with and without the constant plateau? And is this with the same budget of epochs?

I have used clipping with CNN’s and found it to help prevent divergence.

I’d like to thank @radek for your comments. It encourages me to continue to devote the time and effort to this educational “experiment”.

I too have found gradient clipping to help avoid divergence and allow the use of larger LR. However, I’ve primarily used larger values (1 - 20), depending on the problem. I suspect that clipping at 0.1 would slow up the convergence but I don’t have evidence. This would call for an experiment to decide. Also, it likely depends on the dataset and architecture.

Yes, I do mean instability as causing divergence. In my experience, once the training starts to diverge, it dies swiftly. Strong clipping will keep it from diverging but again, too small a value for the gradient will slow the training.

I too am interested in experimenting with AdamW since seeing the new post. I saw the paper when it first appeared but wasn’t motivated to try it until now. Their work is good at creating new motivation.

I was at CVPR and was glad to see Anthony Goldbloom (CEO of Kaggle) discuss both the fastai winning of the DAWNBench competition and your fine achievement in winning a Kaggle’s iMaterialist challenge. I can’t express how gratifying it is that my work is getting so much attention due to fast.ai, for which I am obliged.

I hope you can keep up your good work and I wish you every success.

2 Likes

Thank you very much for your very kind words Leslie :slight_smile:

Ah, I think I found the missing piece in my understanding! By divergence you are referring to the divergence on the train set? A divergence would then be defined as phenomena in which a loss increases by some substantial amount on the train set?

Reason I bring this up is that that the below is a very common pattern I see when training with gradient clipping and the 1cycle policy (this is from the fastai dawnbench result I reproduced on my 1080ti):
download

The training never diverges and is below the instability line on the train set? (I am not sure if my understanding of divergence and the instability line are correct in that they refer to results on the train set).

As a side observation, I wonder what to make from the varying loss on the val set :thinking: The jumps sometimes are even more extreme but in the end the solution generalizes to the val set and all is well. Some time ago I would assume that having high bias (not fitting the training set all that well) would mean there should be small generalization gap - we should be able to fit the val set also not really well, but the performance should be comparable. My recent experience though suggests that this intuition might be wrong - underfitting the val set doesn’t really guarantee we are in a low variance regime (the lower the variance the greater the generalization performance should be?).

Sorry - don’t mean to derail the discussion :slight_smile: Will continue to experiment with AdamW and read the conversation in this thread to see if there is anything that I can contribute.

1 Like

The short answer is yes. Perhaps I am making up a name for when I’ve seen this happen.

I’ve always considered this exploration of the loss landscape. As an analogy, in Deep Reinforcement Learning (DRL), exploration can produce a worse result short term than if it only followed a greedy path but in the end, a better result is found.

I am unsure I follow. My thinking is on underfitting vs. overfitting is laid out in Section 3 of my report athttps://arxiv.org/pdf/1803.09820.pdf. Regularization often (but not always) does hurt the training performance but improves the validation performance. I’ve learned to look for a balance.

If you see anything interesting with your experiments on AdamW, please report them in this thread. Thanks.

3 Likes

Yes it’s with the same number of epochs (I was using 90 for the training of RNNs) and it’s compared to a regular 1cycle. The gain is not huge, but still tangible (roughly one point of ppl if I recall correctly). This doesn’t always work, and is another form of stronger regularization in itself. Adding other types of regularization can make it perform worse than the regular 1cycle.

I guess it depends on the situations, but with the learning rate rising at the beginning, I saw in my experiments that most of the gradients quickly drop (which is probably why the high learning rates work and help) and there are very few that are actually clipped. Empirically, I’ve rarely seen clipping hurt the final result, quite the opposite.

This is typical of super-convergence, and I’ve found it’s the sign you are right at the limit before divergence of the training. Usually, with a lower learning rate, you’ll get less shaking, but the final result may be a bit less good too. It also depends on the architecture: a resnet shakes like this but RNNs rarely do as much.

1 Like

I am definitely enjoying reading the discussion in this type of setting. It is interesting to see the different theories and intuitions shared in this public forum. It gives everybody an opportunity to share ideas and also allows anybody to test the theories that get suggested. I don’t have any ideas to contribute at this point, but I’ve definitely learned a ton from this thread.

1 Like

Good to know. It makes sense that it is a stronger regularization because the large LR is maintained longer.

This matches my experience but if clipped too low, I expect clipping will hurt. Certainly in the limit as the clipping value goes to 0, training will slow and then not happen.

As an aside, I recall seeing a large variance/noise when training DenseNet with super-convergence. When I reduced the regularization, the noise decreased and the performance increased. This was one of several clues where I learned the importance of reducing other forms of regularization (other than large LR) to reach an optimal balance.