Learning Rate Dropout

Just finished reading this paper. Quick summary: Each weight parameter has its own learning rate, and during training, at each iteration, a random sample (p) of those learning rates are kept and the rest are “turned off” (set to zero). The gradients are kept (which allows momentum to still work), but if a parameter’s learning rate is zero, that parameter will not update, which, according to the paper, “is equivalent to uniformly sampling one of 2^n possible parameter subsets for which to perform a gradient update.” This has the effect of maintaining the gradient descent, but adding noise to the direction of the descent, which aids in convergence and escaping saddle points and poor local minima.

The intuitions presented seem interesting. Haven’t attempted to verify their claims. Someone is working on an implementation here: https://github.com/noahgolmant/pytorch-lr-dropout.