I am reading the recent Leslie Smith paper instead of doing my homework (oops), and am confused by some of the talk about regularization.

It’s my understanding that “regularization” is what you do to keep the weights nicely bounded so that they don’t explode or vanish. Weight decay and learning an term to scale the weights by make total sense to me in this context.

However, the Smith paper (and others) talk about many other things as regularizers, including larger learning rates, and smaller batch sizes.

Does anyone have some intuition as to why larger learning rate/smaller batch sizes should be regularizers?

(NB: another Smith paper shows that learning rate and batch size are inversely related, I’m happy to believe that if large learning rate is a regularizer, then so is small batch size.)

Sure, I totally get how learning rate will find you broader minima, but that’s generalization and not regularization. I don’t see how that helps you keep the weights nicely bounded.

Maybe there is a connection between generalization and regularization which I am missing?

Oh, I wasn’t aware that regularization strictly referred to keeping the weights in check. I kind of though it was just a term to describe everything that helps prevent overfitting.

Err, see the paper: https://arxiv.org/abs/1711.00489
There is theory, which I think I would botch if I tried to explain it, but there is also experimental evidence which is what convinced me.