Transfer Learning in fast.ai - How does the magic work?

TomB · October 1, 2019, 5:16am

Not really that I know of. There’s not an incredible amount in the lessons about optimisers and most of what I remember was more introductory, explaining Adam not particular tweaks. The latest part 2 which goes through and builds the library from scratch had a section but it was some new stuff on a Lamb based optimiser, not the Adam based one in v1, then when into the new scheduler ideas (as in one-cycle/cosine-annealing scheduling of parameters).

Oh, oops, if you just did create_cnn_model(models.resnet50, data_c, pretrained=True) then that will result in an unfrozen model. Forgot to add the freezing and the code to create layer groups. So that would be a deviation in those last few experiments. So if you didn’t pull those you’d need:

from fastai.vision.learner import cnn_config # Not in __all__ so need to explicitly import
meta = cnn_config(base_arch) # base arch is the function to create the model, e.g. models.resnet50
learn.split(meta['split'])
learn.freeze()

You can use learn.layer_groups to see the groups.

Looking around there doesn’t actually seem to be that much related to optimisers apart from layer groups. Main things I found was in torch_core you have AdamW = partial(optim.Adam, betas=(0.9,0.99)), this is the default for Learner.opt_func. Then in fastai.callback.OptimWrapper you have some stuff. Best summary I could find was:

    @classmethod
    def load_with_state_and_layer_group(cls, state:dict, layer_groups:Collection[nn.Module]):
        res = cls.create(state['opt_func'], state['lr'], layer_groups, wd=state['wd'], true_wd=state['true_wd'], 
                     bn_wd=state['bn_wd'])
        res._mom,res._beta = state['mom'],state['beta']
        res.load_state_dict(state['opt_state'])
        return res

So those look like the params it’s playing with. I think opt_state is the current state rather than params (it’s a bunch of tensors). So the others would be the key params it’s playing with. lr and wd you’re looked at, so those true_wd and bn_wd would be ones you might look at if you haven’t and are still digging.

Yeah, though looks like you are doing shortish runs (for obvious reasons), it might be a speed thing and PyTorch/TF will catch up in the end, think the general idea in fastai is to use fairly aggressive settings and use extensive regularisation to mitigate the problems with this. While I’m not very experienced in DL (and have only really used fastai apart from a little playing) I very rarely see runs go off the rails, so guess it generally works (but I also likely haven’t used any of the trickier models to train).