Questions Regarding Easy Steps to Train a World-Class Image Classifier

xtermz · March 20, 2018, 9:57pm

I’m currently a few lessons in and attempting the following Kaggle competition:

Referring back to Lesson 1 notes:

Enable data augmentation, and precompute=True
Use lr_find() to find highest learning rate where loss is still clearly improving
Train last layer from precomputed activations for 1-2 epochs
Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
Unfreeze all layers
Set earlier layers to 3x-10x lower learning rate than next higher layer
Use lr_find() again
Train full network with cycle_mult=2 until over-fitting

Is it a hard and fast rule in step (3) that I only train precomputed activations for 1-2 epochs? For the Kaggle competition above, I’m using a resnext101_64 and was able to train 45 iterations before I began to overfit:

learn.fit(lr, 4, cycle_len=1)

epoch trn_loss val_loss accuracy
30 0.614433 0.571995 0.831224
31 0.601567 0.568777 0.831039
32 0.590945 0.569204 0.830854
33 0.607919 0.569634 0.831672
34 0.597689 0.569224 0.829826
35 0.6059 0.567692 0.83096
36 0.566043 0.566638 0.832437
37 0.583778 0.563373 0.831566
38 0.590068 0.564728 0.831382
39 0.583658 0.564554 0.832015
40 0.598059 0.564066 0.832991
41 0.548937 0.564171 0.832964
42 0.582655 0.56441 0.833201
43 0.571591 0.565146 0.832806
44 0.550825 0.564419 0.832991

I’m thinking the reason I’m able to train so many iterations until I start to overfit is that (a) I have 180k training images and (b) resnext101_64 was not trained on furniture images (which is what the competition is based around)

Are there implications down the line once I set precompute = False and learn.unfreeze()? Will I have in some sense ‘overfit’ my final layer and made it dependent on initial and middle layers designed to detect cats / dogs?

My concern is that once I set precompute = False and learn.unfreeze(), that the initial and middle layers will start learning to detect furniture (instead of being tuned to cats & dogs), and that the final layer will need to adjust dramatically (e.g., because it became over-dependent on convolutions of cat & dog initial & middle layers)