Transfer Learning in fast.ai - How does the magic work?

That size parameter is not batch size as I think you intend, rather it is image size, so you are actually resizing to 64 (longest dimension, fastai takes single dimension sizes). I’d have to dig to confirm but think this will make it upsize with the default transforms rather than just not crop. Batch size (bs parametes) is being left as default, which is also 64 hence hard to spot this.
As you seem pretty capable in TF I would note that ImageDataBunch.from_* is mainly intended foer very new learners. Using the separate methods is generally the best way to go as it keeps steps separate and helps avoid some issues like this (given the myriad different things your trying to provide parameters for in a single function). Looking at the source for ImageList.from_folder you can see it’s doing:

src = (ImageList.from_folder(...)
                .split_by_folder(...)
                .label_from_folder(...))

The next bit isn’t so clear from the source, but it’s like:

data = (src.transform(get_transforms(), size=...)
           .databunch(bs=...)
           .normalize(imagenet_stats))

(showing where that size was going and where batch size would go).

There’s an init parameter to cnn_learner taking an init function.

On performance, if you haven’t already you might want to have a poke through fastai.layers, some of the magic is in well-defined units in there.

1 Like

@ TomB Thanks a lot for pointing this out.
Meanwhile I did an implementation in PyTorch, and got about the same results as in Tensorflow, so I concluded the cause is not the weight initialization and neither the different pretrained models. Something in fast.ai has to be going on.

After doing some experiments with respect to the datapipeline and the size parameter you mentioned, finally, I found something interesting:

When defining the pipeline as follows (without resizing or transformations), I finally get bad results:

data = (ImageList.from_folder(data_path)
            .split_by_folder(train='train', valid='test')
            .label_from_folder()
            .databunch(bs=64)
            .normalize(imagenet_stats)
      )

But now it gets really interesting. If I just change the size parameter, I get good results (89% accuracy):

data = (ImageList.from_folder(data_path)
            .split_by_folder(train='train', valid='test')
            .label_from_folder()
            .transform(None, size=64)
            .databunch(bs=64)
            .normalize(imagenet_stats)
      )

That doesn’t make any sense to me at all. I first thought that setting the size parameter would trigger the default transformations (as you suspected). However, this doesn’t seem to be the case: When I add the transformations as follows, without the size parameter to the pipeline, I still get bad results:
.transform(get_transforms())

Also very strange is that if I train my Tensorflow or PyTorch models, with 64x64 image shapes, I get much worse results.
So then I suspected that maybe there was a bug with the size parameter, and that in the end the model is trained with a larger size. However, the output shapes of the CNN do correspond to an input size of 64x64, so that doesn’t seem to be the cause either.

The only reason I could now imagine to cause this behaviour, is that if you set the size parameter, fast.ai pulls a different pretrained model (that was trained on smaller imagenet images, than the default 224x224) and thus works well with 64x64 images.

Does somebody have an idea what is going on?

The same pretrained model is used (for imagenet) unless you use a custom pretrained model. Consider this: is there anything we can really gather from a 64x64 image? I’d say not really if it’s something other than MNIST-type situations, it’ll just look like a random cluster of images. You’re better looking at this comparison using a size more commonly used or “standard” within the library: something of base 8 and above 200. This is usually between 224 and 360. One thing to consider when looking at the original databunch being made is see the size of the tensors that are generated when you call data (it’ll show something like 3x…x…) where after the 3 is the size. Does this clarify somethings for you @nkaenzig ?

Ah here’s another bit you may have missed. When we call cnn_learner our weights are frozen (from the pretrained model) except for the very last layer, so we train only it (and that’s why we call learn.unfreeze() at the end to use the entire thing). Are you accounting for this in Keras or unfreezing after you generated your Learner?

@muellerzr Thanks for your response. I am very aware of these two aspects that you ponted out. In all experiments I conducted, the pretrained layers were frozen, and only the newly added output layers were trained.

It also doesn’t surprise me at all that the Tensorflow and PyTorch models don’t work well when using 64x64 images for training, when they were pretrained with 224x224 images (for the reasons you mentioned).
What does surprise me a lot, is that training with 64x64 images somehow works very well in fast.ai. This is what doesn’t make any sense to me. Something very strange seems to be happening here under the hood.

2 Likes

Thanks for checking for me :slight_smile: Fastai will by default create a head with a dropout of 0 instead of 0.25 like you specify for your Dropout in Keras (also learn.summary showed zero as well) https://github.com/fastai/fastai/blob/master/fastai/layers.py#L44

One little difference I can see right now

No, you need to call get_transforms for default transforms to be applied,
Training at the moment so can’t run much, but poking around in fastai.vision.image.Image.apply_tfms (which is where size ends up), there’s code that if size is set adds a TfmCrop and looks like it is using ResizeMethod.CROP though less sure on that. So you should end up with that as your only transform. You should find a data.train_ds.x.tfms with the final transforms to verify (or maybe data.train_ds.tfms).
I thought perhaps it was randomising as various transforms do, effecting some data augmentation. But it doesn’t look like it, though a bit hard from static analysis. So it is actually sizing your images up to 64x64? I couldn’t quite tell how it was doing that, resizing or just repeated at the edges or something. Still seems odd that would have a large effect.

One thought was that resnet has a stride 2 conv as the first layer, so you instantly lose half the size with only that first conv able to extract information. The resize would preserve more pixels for subsequent processing. I’ve suspected this early reduction may not be ideal in some cases but haven’t done any real testing of this (I was looking at segmentation where you’d thus lose detail, I did find this to perhaps be true, but was varying architectures and implementations so mainly supposition).

Oh, and still doesn’t explain the difference with TF on 64px, unless perhaps fastai is using better resizing. seems a stretch but maybe. Running out of differences.

1 Like

@muellerzr Just retrained my PyTorch model with 0 dropout, didn’t change the results at all, which I expected.

@TomB Thanks a lot for checking this out. The different resizing operation is indeed the only difference I’m seeing at the moment. Later I’m going to do some experiments using the exact same resize operation in TF/PyTorch to see what happens. But I doubt that this is the cause…

Yeah, seems unlikely to be the resize, but can’t see much else.
Did you rebuild the fastai model for PyTorch or just create with fastai and train with PyTorch? I think the later should work fine and would definitely eliminate the model. The other thing on those lines being to use the rebuilt PyTorch model in fastai which should also work.

@TomB I did “rebuild” the model in my PyTorch experiment (in that notebook I didn’t even import the fast.ai library).

I just tried to use the fast.ai model in PyTorch, but I’m getting an error:

AttributeError: ‘ImageFolder’ object has no attribute ‘c’

Here’s the code:

import fastai
import fastai.vision

cifar10_dir = 'data/cifar10/'

data_transforms = {
    'train': transforms.Compose([
        transforms.Resize(IMG_DIM),
        transforms.ToTensor(),
        transforms.Normalize(imagenet_stats[0], imagenet_stats[1])
    ]),
    'test': transforms.Compose([
        transforms.Resize(IMG_DIM),
        transforms.ToTensor(),
        transforms.Normalize(imagenet_stats[0], imagenet_stats[1])
    ]),
}

image_datasets = {x: datasets.ImageFolder(os.path.join(cifar10_dir, x), data_transforms[x]) for x in ['train', 'test']}
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=BATCH_SIZE, shuffle=True, num_workers=4) for x in ['train', 'test']}

databunch = fastai.vision.DataBunch(dataloaders['train'], dataloaders['test'])
learn = fastai.vision.cnn_learner(databunch, fastai.vision.models.resnet50, metrics=[fastai.vision.accuracy], true_wd=False)
learn.fit(3)

The error is thrown, when calling cnn_learner(), not sure what to do to prevent this - the DataBunch is instantiated without error.

Oh yes, sorry cnn_learner won’t work as it calls data.c to get the number of categories. You need create_cnn_model, something like:

data_c = <#categories>
model = create_cnn_model(models.resnet50, data_c, pretrained=True)

should give the same model.

I think though that trying to use the fastai learner with a PyTorch dataset will be tricky. So you might be better training with the PyTorch stuff.

You can also try a fastai learner with a PyTorch model, something like:

learn = Learner(pytorch_model, fastai_data)

should work fine, nothing special on the model end in fastai except for the creation stuff (just setttinga on standard PyTorch layers).

1 Like

@TomB You’re right, running a PyTorch model with a fast.ai Learning was much easier. Just did that, and I got exactly the same results, as when using the fast.ai model (i.e. the resnet50 model that fast.ai uses seems to be same that PyTorch uses).

Here the code I used:

import torchvision
learn = cnn_learner(data, torchvision.models.resnet50, metrics=[accuracy], true_wd=False)
learn.fit(3)

So the magic happens definitely somewhere in the data pipeline… Very strange…

Either the data or the optimiser, fastai wraps the pytorch optimisers with various stuff even when not doing one cycle. Haven’t looked into this much so not sure what tricks are in there. Though given your training without resizing (and anything missed) gave the same results as PyTorch it seems like the data.
I would note that default learning rate to fit uses discriminative learning rates, so lower rates on earlier layer groups. The default value is lr=slice(None, 0.003, None) which uses 0.003 for the final layer group and I believe lr/10 for the first layer group. Passing a single value (i.e. just 0.003) disables this. However as the model is frozen there should only be one layer group active so this shouldn’t come into it.

I think PyTorch’s transforms are just a callable class, so you should be able to pass a fastai transform in. Otherwise would need to subclass to call TfmCrop, to check that.

2 Likes

After doing some experiments, I found that the optimizer in deed seems to have an impact.


I just ran the following experiments:

OPTIMIZER CONFIGS:

A)
learner.fit(3)

B)
learner.fit(3, lr=0.003)

DATA CONFIGS:

X)

data = (ImageList.from_folder(data_path)
            .split_by_folder(train='train', valid='test')
            .label_from_folder()
            .transform(None, size=64)
            .databunch(bs=64)
            .normalize(imagenet_stats)
      )

Y)

data = (ImageList.from_folder(data_path)
            .split_by_folder(train='train', valid='test')
            .label_from_folder()
            .databunch(bs=64)
            .normalize(imagenet_stats)
      )

RESULTS (accuracy)

  • A-X: 89%
  • B-X: 73%
  • A-Y: 78%
  • B-Y: 68%

So specifying a constant learning rate (Config B) does seem to have an impact, even though the base model is frozen - which is kind of strange.

In PyTorch when I perform resizing to 64x64 & use a constant learning rate (which corresponds to the case B-X), I get 72.5% accuracy, which matches the result in fast.ai (73%) - so after all resizing doesn’t seem to matter that much (altough to be more exact I would have to repeat these experiments multiple times and calculate mean & std values…).

In summary, there seems to be some magic happening in the optimizer - it just surprises me that this has such a large impact (89% v.s. 73% !)

Is there an article or a lesson in one of fast.ai’s courses, where they explain the tricks that they apply to the optimizers (other than 1-cycle, weight decay and discriminative learning)? In any case I’m going to have a look at the source code.

2 Likes

Not really that I know of. There’s not an incredible amount in the lessons about optimisers and most of what I remember was more introductory, explaining Adam not particular tweaks. The latest part 2 which goes through and builds the library from scratch had a section but it was some new stuff on a Lamb based optimiser, not the Adam based one in v1, then when into the new scheduler ideas (as in one-cycle/cosine-annealing scheduling of parameters).

Oh, oops, if you just did create_cnn_model(models.resnet50, data_c, pretrained=True) then that will result in an unfrozen model. Forgot to add the freezing and the code to create layer groups. So that would be a deviation in those last few experiments. So if you didn’t pull those you’d need:

from fastai.vision.learner import cnn_config # Not in __all__ so need to explicitly import
meta = cnn_config(base_arch) # base arch is the function to create the model, e.g. models.resnet50
learn.split(meta['split'])
learn.freeze()

You can use learn.layer_groups to see the groups.

Looking around there doesn’t actually seem to be that much related to optimisers apart from layer groups. Main things I found was in torch_core you have AdamW = partial(optim.Adam, betas=(0.9,0.99)), this is the default for Learner.opt_func. Then in fastai.callback.OptimWrapper you have some stuff. Best summary I could find was:

    @classmethod
    def load_with_state_and_layer_group(cls, state:dict, layer_groups:Collection[nn.Module]):
        res = cls.create(state['opt_func'], state['lr'], layer_groups, wd=state['wd'], true_wd=state['true_wd'], 
                     bn_wd=state['bn_wd'])
        res._mom,res._beta = state['mom'],state['beta']
        res.load_state_dict(state['opt_state'])
        return res

So those look like the params it’s playing with. I think opt_state is the current state rather than params (it’s a bunch of tensors). So the others would be the key params it’s playing with. lr and wd you’re looked at, so those true_wd and bn_wd would be ones you might look at if you haven’t and are still digging.

Yeah, though looks like you are doing shortish runs (for obvious reasons), it might be a speed thing and PyTorch/TF will catch up in the end, think the general idea in fastai is to use fairly aggressive settings and use extensive regularisation to mitigate the problems with this. While I’m not very experienced in DL (and have only really used fastai apart from a little playing) I very rarely see runs go off the rails, so guess it generally works (but I also likely haven’t used any of the trickier models to train).

Thanks a lot for checking this out. I played around with these parameters (wd, true_wd, bn_wd) without any notable effect.

Yeah, though looks like you are doing shortish runs (for obvious reasons), it might be a speed thing and PyTorch/TF will catch up in the end,

This didn’t seem to be the reason - in PyTorch/TF I also trained much longer, and experimenting with different learning rates, but I never got accuracies higher than 80%.


But now, I finally found out where the magic is happening:

I thought, that when you create a learner using the cnn_learner() function, by default the complete pretrained cnn model would be frozen, and only the appended head would be trainable.
This is NOT the case! The BatchNorm Layers in the pretrained model are trainable by default! In my TF/PyTorch eperiments, I trained the models with frozen BatchNorm Layers. This was the reason for the different performance.

You can configure this using the train_bn parameter, when creating a new Learner:

learn = cnn_learner(data, models.resnet50, metrics=[accuracy], train_bn=False)

… This gives me only 71% accuracy after 3 epochs (compared to 89%, when train_bn=True).

There are actually threads on that.


Apparently Jeremy also mentions it in part 2 of this years course.

It seems to be absolutely crucial to not freeze the BatchNorm layers when doing CNN transfer learning!

One thing I still don’t understand, is why learner.fit(3) and learner.fit(3, lr=0.003) gave me different results, as 0.003 clearly is the default value for the learning rate… but that’s for another day.

@TomB Thank you very much for your help.

8 Likes

No problem great to see you got to the bottom of it, I was interested to know.

The default is actually slice(None, 0.003, None), this is used to create a range of values to use as LRs for the different layer groups. So it will start at 0 (the default for a range) for the first layer group and go up to 0.003 for the last. Ideally this avoids big jumps in earlier layers which can hurt performance and favors learning in later layers, notably the less fragile linear head. So I gather even in a frozen model this would slow down batchnorm learning in earlier groups making them slower to update to the new distribution of inputs and resulting activations in transfer learning.

3 Likes

The default is actually slice(None, 0.003, None) , this is used to create a range of values to use as LRs for the different layer groups. So it will start at 0 (the default for a range) for the first layer group and go up to 0.003 for the last. Ideally this avoids big jumps in earlier layers which can hurt performance and favors learning in later layers, notably the less fragile linear head. So I gather even in a frozen model this would slow down batchnorm learning in earlier groups making them slower to update to the new distribution of inputs and resulting activations in transfer learning.

That makes sense, thanks for clarifying. I just noted that when I train a model using a smaller, constant learning rates (e.g. learner.fit(3, lr=0.0008)), I get better results. So apparently 0.003 for all layers in the head was to big

Hi, i.m trying yor code and somehow it throws me an error when is making the GlobalAveragePooling2D for “base_model.output”. It.s a shape error. Didn.t had this problem?

Hi teoddor,
No, I’ve never encountered this error during my experiments. Which version of TensorFlow are you using? If you shared your code it might be easier to reproduce the issue.