All,
I recently read and was unimpressed by the paper “On layer level control of DNN training and its impact on generalization” arxiv:1806.01603. In spite of the authors’ claims, their results implied to me that it is best to use uniform, constant layer level learning rates, as is the common practice.
The good news is that it reminded me that for years I’ve wanted to investigate (cyclical) layer level control of learning rates but I haven’t had the time. But if anyone here is also curious about this, we can look into this together.
The first step is a thought experiment; that is, to write down what we think we will find when we run the experiments, why we think it is so, and what experiments will show it. After we are clear about our expectations and why, we can run the experiments, which will likely cause us to revise our thinking. So, before running any experiments, reply to this post with your expectations and reasons.
I will start.
First, I expect that changes to the layer learning rates (LLR) will only effect the training speed but not the generalization or accuracy of the final network. I can think of one reason why uniform, constant LLR might be best - because we are solving for the weights throughout the network (meaning they are interdependent), which is like solving a set of simultaneous equations. If so, one should solve for all of them together.
But I think it is more likely that one should start training with larger LLR in the first layers and smaller LLR in the last layers. Furthermore, near the end of the training the layer learning rates should be reversed; that is, with smaller LLR in the first layers and larger LLR in the last layers.
I can think of 3 reasons for my belief/intuition. First, changes in the first layer’s weights requires changes in all the subsequent layer’s weights so until the first layer’s weights are approximately correct, there’s little value in trying to get the subsequent layers’ weights correct. Hence, increase the LLR of the first layers. Second, a decade ago unsupervised pretraining was a common technique for training networks. The method was to set up the first layer as an autoencoder (AE) and train the weights so as to reconstruct the input. The next step was to add a second layer, fix the first layer’s weights and compute the second layer’s weights as an AE to reconstruct the output from the first layer. One recursively repeats this for every layer in the network. It is clear that this method implies training the layers one at a time. In my mind, dynamic LLR copies this idea. Finally, vanishing gradients is a known difficulty in training networks and it most effects the training of the first layers. Hence, larger learning rates for the first layers in the beginning of training should help. Once those layers are approximately correct, one can lower those LLRs.
As for starting training with smaller LLR in the first layers and larger LLR in the last layers, I think this would slow up the training. Of course, experiments could show the reverse, in which case I’d have to understand why.
As for experiments, there are several factors to consider, such as datasets, architectures, LR policy, and how should the LLR vary from layer to layer and during the training. I prefer starting simply in order to find and fix problems, with the plan to eventually perform a comprehensive set of experiments. One possibility for simple experiments is a shallow network on Cifar-10 and later go to deeper networks and larger datasets. I’d start with a piecewise constant global learning rate that drops by 0.1 at the 50%, 75%, and 90% training times. Later I’d try CLR. I’d vary the layer learning rates linearly from the first to the last layers. Perhaps a first pass might be with the first layer’s learning rate to be 1.5 and the last layer’s learning rate to be 0.5 but these are complete guesses on my part and this will require experimentation. The LLR can change linearly over the course of the training but some other schedule could be better so this too must be tested.
This is my first thoughts on this. What can YOU add to this?
Best regards,
Leslie