I am reading the recent Leslie Smith paper instead of doing my homework (oops), and am confused by some of the talk about regularization.

It’s my understanding that “regularization” is what you do to keep the weights nicely bounded so that they don’t explode or vanish. Weight decay and learning an term to scale the weights by make total sense to me in this context.

However, the Smith paper (and others) talk about many other things as regularizers, including larger learning rates, and smaller batch sizes.

Does anyone have some intuition as to why larger learning rate/smaller batch sizes should be regularizers?

(NB: another Smith paper shows that learning rate and batch size are inversely related, I’m happy to believe that if large learning rate is a regularizer, then so is small batch size.)


How I understood it, larger learning rates, and in particular SGDR and cyclical LRs allow you to find broader minima.

Here is where Jeremy explains it I think:

1 Like

Sure, I totally get how learning rate will find you broader minima, but that’s generalization and not regularization. I don’t see how that helps you keep the weights nicely bounded.

Maybe there is a connection between generalization and regularization which I am missing?

Oh, I wasn’t aware that regularization strictly referred to keeping the weights in check. I kind of though it was just a term to describe everything that helps prevent overfitting.

I think the idea of having a regularizing effect is a very broad term which includes all techniques to reduce overfitting.

How are learning rate and batch size inversely related?

1 Like

Err, see the paper:
There is theory, which I think I would botch if I tried to explain it, but there is also experimental evidence which is what convinced me.

1 Like

Thanks. I will look into it.