Differential learning rate applied to hidden layers

SHAR1 · April 20, 2018, 8:38am

Suppose we have 99 layers. and we have lr=np.array([1e-4,1e-3,1e-2]) , then we must be assigning 1e-4 as the learning rate for the first 33 layers, 1e-3 for the next 33 layers. 1e-2 for the last 33 layers.
It is important to understand that we are doing this differential learning rates in transfer learning. Here, we want the initial layers to train less as these layers mostly represent edges, corners like low-level features (and, even if we make them learn more they would eventually end up learning the edges and corners), so we have a lower learning rate. Instead of starting at random point, we start at a better point. Wiki: Lesson 2 would help.