Cyclical Layer Learning Rates (a research question)


(adrian) #41

I thought the following were intereresting, the first two along the lines of my thinking.

The Hybrid Bootstrap: A Drop-in Replacement for Dropout
Kosar, Robert
Scott, David W.
The hybrid bootstrap is a regularization technique that functions similarly to dropout except that features are resampled from other training points rather than replaced with zeros.
http://arxiv.org/abs/1801.07316

Excitation Dropout: Encouraging Plasticity in Deep Neural Networks
Zunino, Andrea
Bargal, Sarah Adel
Morerio, Pietro
Zhang, Jianming
Sclaroff, Stan
Murino, Vittorio
In this work, we utilize the evidence at each neuron to determine the probability of dropout, rather than dropping out neurons uniformly at random as in standard dropout
http://arxiv.org/abs/1805.09092

Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights
Mallya, Arun
Davis, Dillon
Lazebnik, Svetlana
By building upon ideas from network quantization and pruning, we learn binary masks that piggyback on an existing network, or are applied to unmodified weights of that network to provide good performance on a new task.
http://arxiv.org/abs/1801.06519

An interesting paper on tips for combining dropout with batchnorm:
https://arxiv.org/abs/1801.05134

I’ll keep digging for any more non-randonm dropout/learned regularization strategies


(Leslie N. Smith) #42

Regarding dropout, here is an older paper that occurred to me:

Ba, Jimmy, and Brendan Frey. “Adaptive dropout for training deep neural networks.” In Advances in Neural Information Processing Systems, pp. 3084-3092. 2013.


(Leslie N. Smith) #43

Getting back to the topic of layer-wise learning rates (LLR), it has been a few days since anyone has commented so I will take the opportunity to post my view of a set of experiments that should be informative:

First we need a baseline or two:
Baseline 1: Use of the global LR - both 1cycle and a typical LR policy (i.e., piecewise constant for half the epochs, multiplying by 0.1, …)
Baseline 2: NVIDIA’s LARS or revised version called LARC.

Next we should try a simple method to see if there is any merit in LLRs:
Manual method 1: Start training with a high LLR for the lowest layers, low LLR for the highest layers and linearly interpolate inbetween.
Manual method 2: Keep the global LR constant and cyclical LLR policies.

One automatic method to compare with LARS:
Automatic: LLR = movingAvg( change in norm of weights ) / movingAvg( change in norm of gradients )

Is anyone running any of these experiments?

Leslie

PS. I found a new paper by Dr. Hutter on finding optimal hyper-parameters:
BOHB: Robust and Efficient Hyperparameter Optimization at Scale
https://arxiv.org/abs/1807.01774
I am looking forward to reading it.


(nkiruka chuka-obah) #44

What would be a high learning rate and a low learning rate? Would 0.1 and 0.001 qualify?


(Leslie N. Smith) #45

It depends on the dataset and architecture. For a 3 layer network on Cifar-10, I’ve use 0.02/0.001 for max_lr/min_lr. For Resnet-50 on Imagenet, I’ve used 2/0.1.