I am reading the recent Leslie Smith paper instead of doing my homework (oops), and am confused by some of the talk about regularization.
It’s my understanding that “regularization” is what you do to keep the weights nicely bounded so that they don’t explode or vanish. Weight decay and learning an term to scale the weights by make total sense to me in this context.
However, the Smith paper (and others) talk about many other things as regularizers, including larger learning rates, and smaller batch sizes.
Does anyone have some intuition as to why larger learning rate/smaller batch sizes should be regularizers?
(NB: another Smith paper shows that learning rate and batch size are inversely related, I’m happy to believe that if large learning rate is a regularizer, then so is small batch size.)