Transfer Learning in fast.ai - How does the magic work?

Thanks a lot for checking this out. I played around with these parameters (wd, true_wd, bn_wd) without any notable effect.

Yeah, though looks like you are doing shortish runs (for obvious reasons), it might be a speed thing and PyTorch/TF will catch up in the end,

This didn’t seem to be the reason - in PyTorch/TF I also trained much longer, and experimenting with different learning rates, but I never got accuracies higher than 80%.


But now, I finally found out where the magic is happening:

I thought, that when you create a learner using the cnn_learner() function, by default the complete pretrained cnn model would be frozen, and only the appended head would be trainable.
This is NOT the case! The BatchNorm Layers in the pretrained model are trainable by default! In my TF/PyTorch eperiments, I trained the models with frozen BatchNorm Layers. This was the reason for the different performance.

You can configure this using the train_bn parameter, when creating a new Learner:

learn = cnn_learner(data, models.resnet50, metrics=[accuracy], train_bn=False)

… This gives me only 71% accuracy after 3 epochs (compared to 89%, when train_bn=True).

There are actually threads on that.


Apparently Jeremy also mentions it in part 2 of this years course.

It seems to be absolutely crucial to not freeze the BatchNorm layers when doing CNN transfer learning!

One thing I still don’t understand, is why learner.fit(3) and learner.fit(3, lr=0.003) gave me different results, as 0.003 clearly is the default value for the learning rate… but that’s for another day.

@TomB Thank you very much for your help.

8 Likes

No problem great to see you got to the bottom of it, I was interested to know.

The default is actually slice(None, 0.003, None), this is used to create a range of values to use as LRs for the different layer groups. So it will start at 0 (the default for a range) for the first layer group and go up to 0.003 for the last. Ideally this avoids big jumps in earlier layers which can hurt performance and favors learning in later layers, notably the less fragile linear head. So I gather even in a frozen model this would slow down batchnorm learning in earlier groups making them slower to update to the new distribution of inputs and resulting activations in transfer learning.

3 Likes

The default is actually slice(None, 0.003, None) , this is used to create a range of values to use as LRs for the different layer groups. So it will start at 0 (the default for a range) for the first layer group and go up to 0.003 for the last. Ideally this avoids big jumps in earlier layers which can hurt performance and favors learning in later layers, notably the less fragile linear head. So I gather even in a frozen model this would slow down batchnorm learning in earlier groups making them slower to update to the new distribution of inputs and resulting activations in transfer learning.

That makes sense, thanks for clarifying. I just noted that when I train a model using a smaller, constant learning rates (e.g. learner.fit(3, lr=0.0008)), I get better results. So apparently 0.003 for all layers in the head was to big

Hi, i.m trying yor code and somehow it throws me an error when is making the GlobalAveragePooling2D for “base_model.output”. It.s a shape error. Didn.t had this problem?

Hi teoddor,
No, I’ve never encountered this error during my experiments. Which version of TensorFlow are you using? If you shared your code it might be easier to reproduce the issue.