# Cyclical Layer Learning Rates (a research question)

(Leslie N. Smith) #1

All,

I recently read and was unimpressed by the paper “On layer level control of DNN training and its impact on generalization” arxiv:1806.01603. In spite of the authors’ claims, their results implied to me that it is best to use uniform, constant layer level learning rates, as is the common practice.

The good news is that it reminded me that for years I’ve wanted to investigate (cyclical) layer level control of learning rates but I haven’t had the time. But if anyone here is also curious about this, we can look into this together.

The first step is a thought experiment; that is, to write down what we think we will find when we run the experiments, why we think it is so, and what experiments will show it. After we are clear about our expectations and why, we can run the experiments, which will likely cause us to revise our thinking. So, before running any experiments, reply to this post with your expectations and reasons.

I will start.

First, I expect that changes to the layer learning rates (LLR) will only effect the training speed but not the generalization or accuracy of the final network. I can think of one reason why uniform, constant LLR might be best - because we are solving for the weights throughout the network (meaning they are interdependent), which is like solving a set of simultaneous equations. If so, one should solve for all of them together.

But I think it is more likely that one should start training with larger LLR in the first layers and smaller LLR in the last layers. Furthermore, near the end of the training the layer learning rates should be reversed; that is, with smaller LLR in the first layers and larger LLR in the last layers.

I can think of 3 reasons for my belief/intuition. First, changes in the first layer’s weights requires changes in all the subsequent layer’s weights so until the first layer’s weights are approximately correct, there’s little value in trying to get the subsequent layers’ weights correct. Hence, increase the LLR of the first layers. Second, a decade ago unsupervised pretraining was a common technique for training networks. The method was to set up the first layer as an autoencoder (AE) and train the weights so as to reconstruct the input. The next step was to add a second layer, fix the first layer’s weights and compute the second layer’s weights as an AE to reconstruct the output from the first layer. One recursively repeats this for every layer in the network. It is clear that this method implies training the layers one at a time. In my mind, dynamic LLR copies this idea. Finally, vanishing gradients is a known difficulty in training networks and it most effects the training of the first layers. Hence, larger learning rates for the first layers in the beginning of training should help. Once those layers are approximately correct, one can lower those LLRs.

As for starting training with smaller LLR in the first layers and larger LLR in the last layers, I think this would slow up the training. Of course, experiments could show the reverse, in which case I’d have to understand why.

As for experiments, there are several factors to consider, such as datasets, architectures, LR policy, and how should the LLR vary from layer to layer and during the training. I prefer starting simply in order to find and fix problems, with the plan to eventually perform a comprehensive set of experiments. One possibility for simple experiments is a shallow network on Cifar-10 and later go to deeper networks and larger datasets. I’d start with a piecewise constant global learning rate that drops by 0.1 at the 50%, 75%, and 90% training times. Later I’d try CLR. I’d vary the layer learning rates linearly from the first to the last layers. Perhaps a first pass might be with the first layer’s learning rate to be 1.5 and the last layer’s learning rate to be 0.5 but these are complete guesses on my part and this will require experimentation. The LLR can change linearly over the course of the training but some other schedule could be better so this too must be tested.

This is my first thoughts on this. What can YOU add to this?

Best regards,
Leslie

Research collaboration opportunity with Leslie Smith
(Alex) #2

Hi Leslie,
Apologies if that deviates from your initial thought process, but what if we try to learn learning rate schedules for layers right during a training process?
Similarly to “attention”.
I.e. learn a learning rate scaling factor for each layer from a given global learning rate.
LLR scaling factor would be an output for each layer from a small NN or RNN.
if being collected across many epochs it may give an idea of an optimal LLR function over time.
It most certainly will not speed up the training comparing to a “handcrafted” LLR schedule,
but (if works) would be highly universal approach to the question of layer learning rate schedules.

(Leslie N. Smith) #3

It does deviate from the topic but that too is fine. What you are suggesting sounds like adaptive learning rates and there has been substantial research on adaptive learning rates, which computes a learning rate per neuron during training. I believe Adam is the current favorite. If you would like to explore a new method, I discuss a method in my super-convergence paper (https://arxiv.org/abs/1708.07120) that has not yet been explored - basically using the average of the weight changes to estimate the Hessian, which is used to compute an optimal LR. This too is an available research topic.

(Even Oldridge) #4

My intuition from transfer learning is that earlier layers in the network will converge more quickly and that we’ll want to reduce the learning rates differently as a function of depth. You hit on a similar concept in your second paragraph from a slightly different angle, and I think we’ll need to consider both concepts; i.e. both the initial value of LLR and the rate of decay.

The extreme of this concept is the freezing of earlier layers which is commonly used in transfer learning, but I think I remember a paper where earlier layers are frozen at various points throughout training.

If memory serves, @jeremy explored a concept similar to what you’re mentioning, albiet without the cycles, using different learning rates for different layers of VGG when transfer learning in one of the examples mentioned in class in order to prevent earlier layers from changing too much, but allowing them some flexibility.

This is a very interesting concept and I’ll be following the thread closely. Thanks for sharing your research ideas @Leslie.

(Leslie N. Smith) #5

Yes, both the initial layer’s learning rate and the rate of change is to be investigated. It sounds like you agree with my intuition that one should start training with LLR > 1 in the first layers so they converge quickly, then reduce the LLR as the training proceeds, leading towards a soft freezing of the earlier layers.

The other part of my proposal here is to set the LLR < 1 for the last layers and increase those layers’ LR as the training proceeds. In addition, there is the question of how to set LLR for the between layers. IMO, this is interesting and could result in a new, faster method to train networks.

I would appreciate it if you can point me to where Jeremy explored this concept.

(Even Oldridge) #6

It’s funny, I was just re-listening to cs231n and the description of early neural nets used almost exactly this method, training each layer of the network consecutively before linking them, but I believe that was primarily to help with convergence.

Unfortunately I think it was a one off comment in one of the lectures, and I don’t honestly recall which one. If I recall he’d only explored the idea briefly and he didn’t spend more than a minute on the topic. Hopefully he can comment here on his investigation. If anyone else remembers where that was in the lectures please chime in.

(Scott Mueller) #7

If I’m understanding the topic correctly, I believe the discussion is Discriminative + 1Cycle. It is about 8:53 of Lesson 13. I use hiromi’s wonderful posts tracking Jeremy’s words with corresponding pictures from video, https://medium.com/@hiromi_suenaga/deep-learning-2-part-2-lesson-13-43454b21a5d0. I can read the course when I can’t listen to it.

(Leslie N. Smith) #8

This is known as unsupervised pretraining. This was my second reason given in my first comment of this post.

Thank you for the pointer. Yes, I believe you are right. Jeremy’s discussion of discriminative learning rates is part of this - he suggests fine-tuning the last layer’s learning rate and setting all the other layers LR to a fraction of the last one. I wonder how he came up with his ratio of 2.6.

It would be interesting to start training from scratch with a larger LR for the first layer and letting it decrease with depth, which is the opposite of the formula for the discriminative LR. Then the LR rates would change during training and end up similar to the discriminative LR formula.

(Martin Benson) #9

Interesting topic! I have a couple of quick thoughts:

1. My suspicion is that for plain feedforward networks layerwise variation like this may not help. The reason for thinking that is that poorly optimised weights at any depth scramble any signal in the inputs that might exist. I don’t see much point in optimising early layers only to send the activations through a shredder later on. There’s a similar argument for not trying too hard on later layers when the early layers are not already somewhat optimised.
2. The argument in (1) does not hold for resnet-type architectures. There, my intuition is that starting with high learning rates in early layers initially and then gradually increasing them back through earlier layers would be yield quicker progress than having it the same over all layers throughout. For similar reasons to why layerwise unsupervised pre-training seemed to help and also by analogy with gradient boosting. Perhaps even just the most extreme version where you unfreeze one layer at at time would be the best strategy? I expect that symmetrically, a strategy of having high learning rates in the last layer and then gradually increasing them back through earlier layers would be equally effective.
3. Given that adaptive learning rates are typically more effective than constant ones, I wonder if we can gain some insight to good strategy by inspecting the distributions of step-sizes per layer for say Adam and seeing how they evolve over time?
4. They’re all first impressions so could be way off!

I’d be very happy to help with researching this further - get back to me if I can help!

(Leslie N. Smith) #10

Wonderful! The first step in research is thinking. Thank you @martinbenson for sharing your thoughts. Now I will share mine and we can go forward with some experiments.

In my first post above I gave a similar argument but my belief is not like this. It is good that our opinions differ because an experiment is now required to help us to know what is in fact true.

Your suggestion that the results might well differ depending on the architecture is valid. This implies the need of experiments with both ordinary and resnet architectures. Another question you raise is how to set the layer learning rates relative to depth. One can use a constant LLR for the lower layers, another constant LLR for the middle layers, and another constant LLR for the latter layers. Or let the LLR vary linearly as done in the discriminative LR, which is what I am suggesting for starters.

IMO, unfreezing one layer at a time is this idea taken to an extreme. I suggest a soft freezing/unfreezing but only experiments will reveal the truth of which is best.

You could inspect the stepsize distributions in Adam but I have my reservations. In my super-convergence paper (see https://arxiv.org/abs/1708.07120) I tested Adam and other adaptive LR methods but they did not indicate the use of larger LR. However, in that paper I describe a simple method to estimate the Hessian from the past average changes in the weights (assuming the Hessian changes slowly) and it did indicate the use of large LR. What I recommend is adapting my method to layer by layer average changes in the weights to estimate the LLR. You could try both for comparison.

If you are willing to run experiments to test this, I’d be happy to work with you. Of course, we should share the results on this forum. Perhaps this will lead to a publishable paper.

Best,
Leslie

Amazing thread. I got to know the interesting methodology of doing a thought experiment. I will share my thoughts as well.

1. Changes to LLR might also affect the generalization. I am not very clear on why solving for the weights throughout the network mean that they are interdependent? Is this somehow related to Batch Normalization (or layer normalization)? When composing different layers it might very happen that there are multiple solutions to the set of simultaneous equations. This suggests that different LLR might converge to different Local Minimas.

2. While I am completely convinced that you should have high LLR for first layers, I am not very sure why would one want lower LLR for last layers. Yes it is true that the it is meaningless to optimize the last few layers if the first few layers are not optimized. But the gradient would still flow through the last layers which can affect the gradient updates as well. Suppose you have three layered network. Initially, I train only the first layer (taking your version to the extreme, high LLR for first, 0 LLR for second, third). This gives us a decent enough network. Next we add the second layer, repeat the process till the third. What essentially happened is we are giving a good initial point for the If we had learnt the whole model at once, we would have started from an arbitrary initial point, convergence might have been difficult and even if it was correct the convergence might be to a completely different point though.

3. Flipping the LLR point stands. The intuition seems pretty correct.

4. Another probable thing that might work. Have cyclical rates with larger cycle len in first few layers, cyclical rates with smaller cycle len in last few layers. But the initial learning rate would be the same.

I am just documenting my first thoughts, they could be way off.

Cheers

(Leslie N. Smith) #12

Good thoughts by @TheShadow29. These thought experiments are an essential part of knowing what experiments to run and provide the necessary feedback to confirm or change our thinking.

Changes to the LLRs might affect the generalization - I don’t know. If the experiments show that it does, I will need to think to gain a better understanding of why and my intuition will improve as a result.

The reason the values of the weights are interdependent is similar to solving a set of linear equations - the values obtained for the first equation depends on all the others (assuming a non-overcomplete set with a unique solution). In other words, the values of the first layer’s weights change the output from the layer and the values of the weights for the next layer will need to adjust due to the different input to the layer. You seem to understand this in your later comments.

This is good. It helps one’s thinking to strip things to be as simple as possible (i.e., 3 layer network, high LLR for the first layer and 0 for the other two).

My reason for lower LLR for the last layers is to have a more stable convergence, which will allow the use of larger LLRs (and result in faster training). If the weights in the last layers change slowly, the gradients might be more consistent for the first layers. But I could also be way off. This thought experiment helps a great deal in improving my understanding too.

Interesting thought that each layer can have a different length. However, I think the first few layer’s cycle lengths should be much shorter because they will converge much more quickly than in a typical training and it is the last layers that need most of the training time (longer cycle). Experiments should prove this one way or the other.

I will clarify one more thing that I am thinking but I have not made explicit. My thinking is to have the up-down cycle only in the global learning rate. Actually, the LLR should be linear and not a cycle. That is, it is one value at the start of the training and another value at the end. The length of change can be different for each layer.

For example, one might train a 3 layer network for 100 epochs. One might set the LLR for the 1st layer to 2 initially and let it linearly drop to 0.5 in the first 10 epochs, keep the second layer’s LLR constant at 1.0, and the 3rd layer’s LLR could initially be 0.5 and linearly increase to 2.0 over the full 100 epochs. This is what I have in mind and you can see from this example that there are many potential variations possible.

Does anyone else have thoughts about any of this they want to share?

Best,
Leslie

Some general thoughts I had on reading the Carbonnelle and De Vleeschouwe 2018 paper: Their use of weight rotation rate was interesting. As for the rest of the paper I’d have to do a fair bit more study to really understand their results.

Not exactly related to your line of thinking above but would be interesting to look out how regularization strategies could be combined with LLR for better accuracy - you mention in Smith and Topin 2018 that regularization should be tuned for each dataset and architecture. One thought is that perhaps using rate of change of gradients as a way to determine how to apply regularization - eg drop out those that are not moving much or moving a lot and see how this affects generalized results. Or look at how to balance other forms of regulaization vs regularization by high learning rates. Even though this may be a little off course to CLLR research I thought I’d mention anyway.

(Even Oldridge) #14

It’s not obvious to me why you’d want the peak learning rate to occur at the same point for all layer groups. Can you go into more detail about that intuition?

I also wanted to integrate one cycle and one way to do that that I think would be effective here would be to do a cascading series of cycles, with all layer groups converging at the final lowered learning rate at the end of the cycle. I think the spacing between peaks may provide opportunities for the earlier parts of the net to stabalize. The min and max of the cycle will likely not be the same as well, which is likely another important component.

(Leslie N. Smith) #15

As a thought experiment, what do you expect you will see if you used the rate of change of the gradients to guide dropout? What difference would it be to use dropout on the neurons with little or much gradient change?

I believe balancing other forms of regularization versus regularization by large learning rates is a general phenomenon. It is related to LLR because large LLRs could increase the regularization. Hence this too should be part of this study.

(Leslie N. Smith) #16

I am unsure what you are asking. Let me remind you that the global learning rate is entirely separate than the layer learning rates. The global LR could go by any policy, including 1cycle. Here I am talking about modifying the LLRs linearly during the training. I gave an example of the LLRs above.

Again I don’t understand. Perhaps if you gave a specific example and explained your reasoning why it would be efffective.

(Even Oldridge) #17

You’re talking about a global learning rate, scaled linearly by LLRs, correct? I’d assumed based on your quote below that you would be scaling a one-cyle pattern, but I agree that it doesn’t have to be one cycle.

What I’m suggesting is that instead of scaling a global learning rate by some changing linear LLRs, that the LLRs would be independent and there wouldn’t be a global learning rate. By decoupling there’s more room for flexibility. One of the ways in which I envision that flexibility being useful is having earlier layers peak earlier in the training cycle. Imagine 5 layer groups, trained with 1-cycle, over 50 epochs. Group 1 could peak at epoch 10, group 2 at 15, group 3 at 20, etc.

It’s not clear to me what advantage there is in linearly scaling a global learning rate schedule over giving each layer their own independent schedule. The design space is much smaller. Unless I’m somehow misunderstanding your description of linearly LLRs and how they interact with the global LR.

My thoughts were that perhaps those weights with more trend on gradients may be adapting to less general features, those with a stable rate of gradient change either are not important or representing more general features. If we dropped out only from the population of either dud/slow learners vs the population of movers maybe this could beat purely random selection. Or maybe this would end up performing more poorly in general cases.
Another thought was to apply dropout to only a population with values for example one+ standard deviation from the mean after training for a few epochs- sort of a selective weight decay.
But I am clearly digressing off topic.

(Leslie N. Smith) #19

Now I understand - thanks for the explanation. To paraphrase, leave the global learning rate constant and let each layer’s LR contain the entirety of the change.

Your suggestion is more general than mine and worthy of investigation. In a research project, it is best to start as simply as possible and evolve into the more complex scenarios. I’d recommend with the first set of experiments as I described (constant global LR, linearly varying LLRs) and later run it as you described. It is a good suggestion.

(Leslie N. Smith) #20

I recommend you perform a literature search. I recall seeing a few papers that recommend non-random dropout. Let me know what you find.