Learning rate and regularization

linfan68 · June 7, 2019, 1:21pm

A very basic question: does small learning rate has some regularization effect that prevents over fitting?

Here is the experiment I’m doing:
I’m using a mobilenet v2 backbone with linear regression head that detect the nose position of a user.
On my local machine I’m using a very small set of data (1k images) to test things out. And found something confusing:

Four runs with different bs and max_lr, all using fit_one_cycle():
Run#1. bs=8, max_lr=0.1
epoch train_loss valid_loss time
0 0.076593 0.054923 00:22
1 0.046309 0.135450 00:22
2 0.046450 nan 00:22
3 0.042525 nan 00:22
4 0.037484 nan 00:22

Run#2. bs=32, max_lr=0.1
epoch train_loss valid_loss time
0 0.151003 0.083149 00:21
1 0.081594 0.068769 00:21
2 0.054937 0.112798 00:21
3 0.041963 0.111875 00:21
4 0.034696 0.331097 00:21

Run#3. bs=8, max_lr=0.01
epoch train_loss valid_loss time
0 0.262888 0.296105 00:22
1 0.070026 0.043623 00:22
2 0.035406 0.038374 00:22
3 0.029958 0.043296 00:22
4 0.025064 0.041888 00:22

Run#4. bs=32, max_lr=0.01
epoch train_loss valid_loss time
0 0.336268 0.307976 00:21
1 0.279342 0.423229 00:21
2 0.228122 0.302725 00:21
3 0.175511 0.173201 00:21
4 0.127010 0.088470 00:21
5 0.087338 0.054099 00:21
6 0.059434 0.054583 00:21
7 0.041613 0.049879 00:21
8 0.030702 0.051899 00:21
9 0.022852 0.049011 00:21
10 0.017813 0.050474 00:21
11 0.014728 0.045301 00:21
12 0.012539 0.047307 00:21
13 0.010974 0.046634 00:21
14 0.009960 0.045835 00:21
15 0.009033 0.045733 00:21
16 0.008392 0.047157 00:21
17 0.007924 0.049009 00:21
18 0.007636 0.047171 00:21
19 0.007190 0.048817 00:21

So I think I’m missing something very basic, here are what I know:
a. If valid loss >> train loss, means over fitting
b. if train loss explode, LR too high
c. larger batch size is better

But why larger LR makes validation loss explode (between Run#2 and Run#4)? Does smaller LR also have some sort of regularization effect? Or something else is happening here?

Also, it seems like smaller batch size makes the training faster (by having more iterations per epoch?) does bigger batch size ALWAYS better?

Another interesting thing is that those results are from “head only” training, with all backbone parameters frozen with imagenet weights. I’ve tried multiple backbone networks (resnet34, resnet50, vgg, and different mobilenets) with same regression head structure (few sets of linear-bn-dropout layer groups). Only THIS mobilenet v2 (which I ported from tensorflow using mmDnn tools) have this issue… all other experiments runs really well (train and valid losses are going down together). Even the “native pytorch” mobilenetv2 from torchvision works pretty well…

Thanks
Fan