Transfer Learning in - How does the magic work?

I’ve recently started experimenting with I was shocked to find that in all experiments I conducted, it significantly outperformed my Tensorflow (2.0) models, despite using the same model architectures, optimizers and loss functions.

As in tensorflow you don’t get features such as 1-cycle policy, weight decay or the fancy data transformations out of the box - I disabled all of these features to get a fairer comparison, but still performs better, no matter what I do.

Here the results from an experiment I conducted today:


  • Dataset: CIFAR-10
  • Model: Resnet50 (pretrained on ImageNet)
  • Optimizer: Adam (learning-rate=0.003)
  • Batch-size: 64
  • No weight decay
  • No data transformations
  • No 1-cycle policy

Here my code:

data = ImageDataBunch.from_folder(data_path, train="train", valid="test", ds_tfms=None, size=64, num_workers=0).normalize(imagenet_stats)

learn = cnn_learner(data, models.resnet50, metrics=[accuracy], true_wd=False)

And here my Tensorflow model:

base_model = keras.applications.resnet_v2.ResNet50V2(weights="imagenet", include_top=False)

for layer in base_model.layers:
    layer.trainable = False

avg = keras.layers.GlobalAveragePooling2D()(base_model.output)
mx = keras.layers.GlobalMaxPooling2D()(base_model.output)
out = tf.keras.layers.Concatenate()([avg, mx])
out = keras.layers.BatchNormalization()(out)
out = keras.layers.Dropout(0.25)(out)
out = keras.layers.Dense(512, activation="relu")(out)
out = keras.layers.BatchNormalization()(out)
out = keras.layers.Dropout(0.25)(out)
out = keras.layers.Dense(nr_classes, activation="softmax")(out)

model = keras.models.Model(inputs=base_model.input, outputs=out)

optimizer = tf.keras.optimizers.Adam(lr=0.003)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

I’ve designed the output layers that are added after the freezed CNN layers to be identical to the ones that uses (although I wasn’t sure about the Dropout rate of, I ended up using 0.25)

Results (only training the output layers, without unfreeze() !):

  • 85% after one epoch, 89% after 3 epochs
  • Tensorflow: 81% after one epoch, 80% after 3 epochs


  1. Why are the results in Tensorflow so much worse? Is it just that the pretrained weights of for resnet50 are better, or am I missing some of the magic that uses?

  2. In Tensorflow, I reshaped the images from 32x32 to 224x224, which is the image size that was used for pretraining the resnet50 model with imagenet. When I used the 32x32 images directly for training, I got really bad results (~10% accuracy). In, I don’t perform any resizing, and it just works. Does resize the images by default (I don’t think that it does…)? Was the resnet model trained with different image resolutions, or is there another trick being used?

Here are the notebooks I used for this experiment:


Try using the same init function? I think fastai uses kaiming_normal and keras defaults to glorot_uniform

Thanks for the hint, I forgot about comparing weight initialization - gonna try that.
However, I doubt that this will boost my TF model by 9%.
Do you happen to know how I change the weight initialization in`It seems that kaiming_normal is not available in TF/Keras.

@nkaenzig - fastai is essentially a wrapper for pytorch. If you can’t find a way to change the weights using fastai’s api (I haven’t gone through the entire documentation), you can alternatively use pytorch (

Here’s a reference. The below commands looks at the existing weights from one of the layer in my model. The next line reinitializes it using kaiming normal. You can probably change it to something else based on pytorch’s documentation.

1 Like

Thanks for the reply. I’m gonna do an implementation in pure pytorch, and try out the same weight initialization as in keras.

1 Like

That size parameter is not batch size as I think you intend, rather it is image size, so you are actually resizing to 64 (longest dimension, fastai takes single dimension sizes). I’d have to dig to confirm but think this will make it upsize with the default transforms rather than just not crop. Batch size (bs parametes) is being left as default, which is also 64 hence hard to spot this.
As you seem pretty capable in TF I would note that ImageDataBunch.from_* is mainly intended foer very new learners. Using the separate methods is generally the best way to go as it keeps steps separate and helps avoid some issues like this (given the myriad different things your trying to provide parameters for in a single function). Looking at the source for ImageList.from_folder you can see it’s doing:

src = (ImageList.from_folder(...)

The next bit isn’t so clear from the source, but it’s like:

data = (src.transform(get_transforms(), size=...)

(showing where that size was going and where batch size would go).

There’s an init parameter to cnn_learner taking an init function.

On performance, if you haven’t already you might want to have a poke through fastai.layers, some of the magic is in well-defined units in there.

1 Like

@ TomB Thanks a lot for pointing this out.
Meanwhile I did an implementation in PyTorch, and got about the same results as in Tensorflow, so I concluded the cause is not the weight initialization and neither the different pretrained models. Something in has to be going on.

After doing some experiments with respect to the datapipeline and the size parameter you mentioned, finally, I found something interesting:

When defining the pipeline as follows (without resizing or transformations), I finally get bad results:

data = (ImageList.from_folder(data_path)
            .split_by_folder(train='train', valid='test')

But now it gets really interesting. If I just change the size parameter, I get good results (89% accuracy):

data = (ImageList.from_folder(data_path)
            .split_by_folder(train='train', valid='test')
            .transform(None, size=64)

That doesn’t make any sense to me at all. I first thought that setting the size parameter would trigger the default transformations (as you suspected). However, this doesn’t seem to be the case: When I add the transformations as follows, without the size parameter to the pipeline, I still get bad results:

Also very strange is that if I train my Tensorflow or PyTorch models, with 64x64 image shapes, I get much worse results.
So then I suspected that maybe there was a bug with the size parameter, and that in the end the model is trained with a larger size. However, the output shapes of the CNN do correspond to an input size of 64x64, so that doesn’t seem to be the cause either.

The only reason I could now imagine to cause this behaviour, is that if you set the size parameter, pulls a different pretrained model (that was trained on smaller imagenet images, than the default 224x224) and thus works well with 64x64 images.

Does somebody have an idea what is going on?

The same pretrained model is used (for imagenet) unless you use a custom pretrained model. Consider this: is there anything we can really gather from a 64x64 image? I’d say not really if it’s something other than MNIST-type situations, it’ll just look like a random cluster of images. You’re better looking at this comparison using a size more commonly used or “standard” within the library: something of base 8 and above 200. This is usually between 224 and 360. One thing to consider when looking at the original databunch being made is see the size of the tensors that are generated when you call data (it’ll show something like 3x…x…) where after the 3 is the size. Does this clarify somethings for you @nkaenzig ?

Ah here’s another bit you may have missed. When we call cnn_learner our weights are frozen (from the pretrained model) except for the very last layer, so we train only it (and that’s why we call learn.unfreeze() at the end to use the entire thing). Are you accounting for this in Keras or unfreezing after you generated your Learner?

@muellerzr Thanks for your response. I am very aware of these two aspects that you ponted out. In all experiments I conducted, the pretrained layers were frozen, and only the newly added output layers were trained.

It also doesn’t surprise me at all that the Tensorflow and PyTorch models don’t work well when using 64x64 images for training, when they were pretrained with 224x224 images (for the reasons you mentioned).
What does surprise me a lot, is that training with 64x64 images somehow works very well in This is what doesn’t make any sense to me. Something very strange seems to be happening here under the hood.


Thanks for checking for me :slight_smile: Fastai will by default create a head with a dropout of 0 instead of 0.25 like you specify for your Dropout in Keras (also learn.summary showed zero as well)

One little difference I can see right now

No, you need to call get_transforms for default transforms to be applied,
Training at the moment so can’t run much, but poking around in (which is where size ends up), there’s code that if size is set adds a TfmCrop and looks like it is using ResizeMethod.CROP though less sure on that. So you should end up with that as your only transform. You should find a data.train_ds.x.tfms with the final transforms to verify (or maybe data.train_ds.tfms).
I thought perhaps it was randomising as various transforms do, effecting some data augmentation. But it doesn’t look like it, though a bit hard from static analysis. So it is actually sizing your images up to 64x64? I couldn’t quite tell how it was doing that, resizing or just repeated at the edges or something. Still seems odd that would have a large effect.

One thought was that resnet has a stride 2 conv as the first layer, so you instantly lose half the size with only that first conv able to extract information. The resize would preserve more pixels for subsequent processing. I’ve suspected this early reduction may not be ideal in some cases but haven’t done any real testing of this (I was looking at segmentation where you’d thus lose detail, I did find this to perhaps be true, but was varying architectures and implementations so mainly supposition).

Oh, and still doesn’t explain the difference with TF on 64px, unless perhaps fastai is using better resizing. seems a stretch but maybe. Running out of differences.

1 Like

@muellerzr Just retrained my PyTorch model with 0 dropout, didn’t change the results at all, which I expected.

@TomB Thanks a lot for checking this out. The different resizing operation is indeed the only difference I’m seeing at the moment. Later I’m going to do some experiments using the exact same resize operation in TF/PyTorch to see what happens. But I doubt that this is the cause…

Yeah, seems unlikely to be the resize, but can’t see much else.
Did you rebuild the fastai model for PyTorch or just create with fastai and train with PyTorch? I think the later should work fine and would definitely eliminate the model. The other thing on those lines being to use the rebuilt PyTorch model in fastai which should also work.

@TomB I did “rebuild” the model in my PyTorch experiment (in that notebook I didn’t even import the library).

I just tried to use the model in PyTorch, but I’m getting an error:

AttributeError: ‘ImageFolder’ object has no attribute ‘c’

Here’s the code:

import fastai

cifar10_dir = 'data/cifar10/'

data_transforms = {
    'train': transforms.Compose([
        transforms.Normalize(imagenet_stats[0], imagenet_stats[1])
    'test': transforms.Compose([
        transforms.Normalize(imagenet_stats[0], imagenet_stats[1])

image_datasets = {x: datasets.ImageFolder(os.path.join(cifar10_dir, x), data_transforms[x]) for x in ['train', 'test']}
dataloaders = {x:[x], batch_size=BATCH_SIZE, shuffle=True, num_workers=4) for x in ['train', 'test']}

databunch =['train'], dataloaders['test'])
learn =,, metrics=[], true_wd=False)

The error is thrown, when calling cnn_learner(), not sure what to do to prevent this - the DataBunch is instantiated without error.

Oh yes, sorry cnn_learner won’t work as it calls data.c to get the number of categories. You need create_cnn_model, something like:

data_c = <#categories>
model = create_cnn_model(models.resnet50, data_c, pretrained=True)

should give the same model.

I think though that trying to use the fastai learner with a PyTorch dataset will be tricky. So you might be better training with the PyTorch stuff.

You can also try a fastai learner with a PyTorch model, something like:

learn = Learner(pytorch_model, fastai_data)

should work fine, nothing special on the model end in fastai except for the creation stuff (just setttinga on standard PyTorch layers).

1 Like

@TomB You’re right, running a PyTorch model with a Learning was much easier. Just did that, and I got exactly the same results, as when using the model (i.e. the resnet50 model that uses seems to be same that PyTorch uses).

Here the code I used:

import torchvision
learn = cnn_learner(data, torchvision.models.resnet50, metrics=[accuracy], true_wd=False)

So the magic happens definitely somewhere in the data pipeline… Very strange…

Either the data or the optimiser, fastai wraps the pytorch optimisers with various stuff even when not doing one cycle. Haven’t looked into this much so not sure what tricks are in there. Though given your training without resizing (and anything missed) gave the same results as PyTorch it seems like the data.
I would note that default learning rate to fit uses discriminative learning rates, so lower rates on earlier layer groups. The default value is lr=slice(None, 0.003, None) which uses 0.003 for the final layer group and I believe lr/10 for the first layer group. Passing a single value (i.e. just 0.003) disables this. However as the model is frozen there should only be one layer group active so this shouldn’t come into it.

I think PyTorch’s transforms are just a callable class, so you should be able to pass a fastai transform in. Otherwise would need to subclass to call TfmCrop, to check that.


After doing some experiments, I found that the optimizer in deed seems to have an impact.

I just ran the following experiments:



B), lr=0.003)



data = (ImageList.from_folder(data_path)
            .split_by_folder(train='train', valid='test')
            .transform(None, size=64)


data = (ImageList.from_folder(data_path)
            .split_by_folder(train='train', valid='test')

RESULTS (accuracy)

  • A-X: 89%
  • B-X: 73%
  • A-Y: 78%
  • B-Y: 68%

So specifying a constant learning rate (Config B) does seem to have an impact, even though the base model is frozen - which is kind of strange.

In PyTorch when I perform resizing to 64x64 & use a constant learning rate (which corresponds to the case B-X), I get 72.5% accuracy, which matches the result in (73%) - so after all resizing doesn’t seem to matter that much (altough to be more exact I would have to repeat these experiments multiple times and calculate mean & std values…).

In summary, there seems to be some magic happening in the optimizer - it just surprises me that this has such a large impact (89% v.s. 73% !)

Is there an article or a lesson in one of’s courses, where they explain the tricks that they apply to the optimizers (other than 1-cycle, weight decay and discriminative learning)? In any case I’m going to have a look at the source code.


Not really that I know of. There’s not an incredible amount in the lessons about optimisers and most of what I remember was more introductory, explaining Adam not particular tweaks. The latest part 2 which goes through and builds the library from scratch had a section but it was some new stuff on a Lamb based optimiser, not the Adam based one in v1, then when into the new scheduler ideas (as in one-cycle/cosine-annealing scheduling of parameters).

Oh, oops, if you just did create_cnn_model(models.resnet50, data_c, pretrained=True) then that will result in an unfrozen model. Forgot to add the freezing and the code to create layer groups. So that would be a deviation in those last few experiments. So if you didn’t pull those you’d need:

from import cnn_config # Not in __all__ so need to explicitly import
meta = cnn_config(base_arch) # base arch is the function to create the model, e.g. models.resnet50

You can use learn.layer_groups to see the groups.

Looking around there doesn’t actually seem to be that much related to optimisers apart from layer groups. Main things I found was in torch_core you have AdamW = partial(optim.Adam, betas=(0.9,0.99)), this is the default for Learner.opt_func. Then in fastai.callback.OptimWrapper you have some stuff. Best summary I could find was:

    def load_with_state_and_layer_group(cls, state:dict, layer_groups:Collection[nn.Module]):
        res = cls.create(state['opt_func'], state['lr'], layer_groups, wd=state['wd'], true_wd=state['true_wd'], 
        res._mom,res._beta = state['mom'],state['beta']
        return res

So those look like the params it’s playing with. I think opt_state is the current state rather than params (it’s a bunch of tensors). So the others would be the key params it’s playing with. lr and wd you’re looked at, so those true_wd and bn_wd would be ones you might look at if you haven’t and are still digging.

Yeah, though looks like you are doing shortish runs (for obvious reasons), it might be a speed thing and PyTorch/TF will catch up in the end, think the general idea in fastai is to use fairly aggressive settings and use extensive regularisation to mitigate the problems with this. While I’m not very experienced in DL (and have only really used fastai apart from a little playing) I very rarely see runs go off the rails, so guess it generally works (but I also likely haven’t used any of the trickier models to train).