because check like this

if torch.cuda.is_available():
    return x.cuda(*args, **kwargs)

is not everywhere in the code:

if isinstance(input_size[0], (list, tuple)):
    x = [Variable(torch.rand(1,*in_size)).cuda() for in_size in input_size]

Found these papers worth sharing

Snapshot ensembles

1 Like

Snapshot ensembles are great, but FreezeOut is probably a waste of time.

Does fastai make snapshots during differential learning rate annealing?

Yes, if you add the cycle_save_name param to fit()



this is my understanding of precompute, unfreeze, etc. (and what I’m doing now in my jupyter notebooks after the 2 first lessons) :

  1. Setup your variables
    PATH = "data/dogscats/"; arch=resnet34; sz = 224

bs=64 (batch size) is the default value in methods used below, so you don’t need to define it here.

  1. Setup your data augmentation (DA)
    tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)

The DA of your training images will have an impact only if precompute=False in your new model (learn) but if you define your DA now (thanks to aug_tfms), you will not need to take care of it after (ie, when you will run learn.precompute=False, cf point 6).

  1. Format your data
    data = ImageClassifierData.from_paths(PATH, tfms=tfms)

At this point, your data (images) are formatted according to your pre-trained model (arch) and preferences (sz, DA, zoom…), and they are ready-to-be used.

  1. Setup your new neural network (NN)
    learn = ConvLearner.pretrained(arch, data, precompute=True)

The pretrained method creates your new NN from the arch model :

  • by keeping all layers but the last one (ie, the output one which gives probabilities within 1000 classes ImageNet)
  • which is remplaced by adding few layers (@jeremy will give details later in the course I think) that end with an output layer which gives probabilities within 2 classes (dogs, cats).

At its creation (ie, when you run the code above : learn = ...) and by default, the new NN freezes the first layers (the ones from arch) and downloads the pre-trained weights of arch.
More, precompute=False by default. Therefore, you must precise precompute=True if you want to change the default behavior.

What does precompute=True ? It tells your new NN learn to process only one time your data (images) through the arch model (but its last layer that was removed) using its pre-trained weights. That’s what we mean by the expression “compute the activations”. This transformation by activation of your data is done only one time and now the new values of your data can be used as inputs of the last layers of your new NN that you are about to train (cf point 5).

Note 1 : even if you have put on the data augmentation (cf aug_tfms in point 2), this has no impact when precompute=True as the activation of your data (images) is computed only one time. So, at each new epoch used in the training, the values used as inputs of the last layers of your new NN are the ones computed at the first epoch.

Note 2 : there is no obligation to set precompute=True but the training of your new NN will be faster as your data (images) are processed only one time through the first layers. Therefore, it is interessing when you start your project.

  1. Train the last layers of your new NN,1)

Through lr_find(), you choose the best learning rate and then, train your NN using the fit method (use 1 to 3 epochs). At this point (precompute=True and first layers frozen), only the last layers of your new NN will be trained (ie, their weights will be updated in order to minimize the loss of the model).

  1. Improve the weights of your last layers by data augmentation and SGDR

The more data you have, the better model you will get. Set precompute=False and then, at each new epoch used in the training of your new NN, the activation of your augmented data (cf point 2) will be computed. As well, you should use the stochastic gradient descent with restarts (SGDR) at this point.

  1. Improve your new NN (all layers)

At this point, only your last layers have been trained. Then, you should trained now all the layers of your new NN together. This is done by setting unfreeze the first layers of learn.

Note : before to train again your model (using fit), you should use the lr_find() method again in order to select the best learning rate of your NN with all layers unfrozen. As well, you should use the differential learning rates and cycle_mult parameter.

  1. Final steps : increase sz and move to a better pre-trained model

See the jupyter notebook.


That’s a great explanation! Note that when you unfreeze, you probably want differential learning rates, so you don’t trample over the carefully tuned weights in the early layers.


Thanks to everyone for the insightful comments.
I still don’t get why data augmentation can’t be used with precompute=True.
Why can’t the library precompute the activations for the augmented images? Do they change in every epoch? If so why, does this help with generalization?


Let’s see if I can explain this well.

precompute=True doesn’t really help with generalization. It makes the training time shorter because the library pre-calculates the activations of the portion that you are not training once. In other words, if the layers are frozen, the weight will not change during the training. So given the same input, no matter how many times you calculate the activations, the result will be the same. By pre-computing that portion once and just using that values, it saves you time.

It is not impossible to pre-compute the activations for augmented images if you save augmented images somewhere and use them every time. But libraries use random variables to modify images, so you can’t pre-compute them.

Does that help?


Yes they do. If they didn’t, it wouldn’t help with generalization - the whole point is to have maximum variety in the data.


Hello @stathis,

I hope my following explanations will help your understanding. Tell me back please if not.

When you have precompute=True, your model calculates only one time the activations from the input layer to the output of the pre-trained model (the arch model without its last layer which was removed).
Then, when you train your new model (, the activations are the new inputs used to train your last layers, not the initial data (images).
Therefore, even if you have your data augmentation on, it has no effects since at each epoch, your new model uses the activations calculated at the first epoch as inputs to the last layers. It does not compute again theses activations even if your input images are modified trough the data augmentation process.

To take into account your data augmentation on your input images, you must have precompute=False. In this case, your model will take into account at each new epoch your new images (processed by the data augmentation algorithm) and will calculate their activations through the first layers (the ones of the arch model without its last layer which was removed).
Then, it helps the generalization as it is like you have more data (images) to train your model.


Question on precomputed activations and test data:

Say you have saved weights, after training all the conv layers earlier, and you restart a notebook.

Is there a way to precompute activations for the test data? When you initialize a learner, it’s pulling its weights from the downloaded model, and precomputes its activaitons at init. time. Loading weights afterward would throw all that away, no?

Again, this is only applying to having trained the Conv layers via learn.unfreeze() at some point before saving the weights / learner.

I believe that test set activations are precomputed as necessary and stored also.

Interesting. Is there something else going on when I init. the learner? On my workstation it’ll take a few minutes to precompute the activations, but I don’t notice any time-lag like that when I load weights.

I suppose it’s easy enough to test without knowing the library intimately; the time difference with/without precomputation is noticeable.

If after all, we are training all the layers (first and last) together then why did we only train the last layers (using precompute=true and freezing first layers) at first?

1 Like

Because those layer weights start off as random whereas the weights in the pretrained model are already pretty good.

By freezing the layers of the pretrained model we are allowing the new layers weights to get pretty good as well before unfreezing and attempting to make minor improvements to all.

If you tried to start training all the layers to being with, the random weights the new layers start out with may throw the pretrained weights out of whack.


I didn’t understand it.
also, how many old and new layers would be there?
When precompute = true,
activations for the frozen layers will be calculated by say first data and for rest the of the data those activations will be used as it is ?

To simply things a bit, there are millions of parameters in a convolutional neural network. Training a model is the process of updating them to get the outcome you want.

We are using a pre-trained model where those weights have already been figured out for image classification, whereas we are appending new layers, with new weights/parameters, for our task at hand (e.g., identifying dog breeds or whether an image is a picture of a dog or cat). Those new layers, will have a bunch of parameters themselves that start off as random.

If we then start by training the whole network, those random initialized weights are going to throw off the pre-trained weights as the process tries to adjust every weight in the network to best predict whatever you are trying to predict. That is why we train the last layers first and then, and only then, see if we can make very minor changes to the already pre-trained weights.

If you want to see what the model looks like after precommpute=True, run this in a cell: learn. And yes, the precommputed activates will be the input to your model. To see how the model will be different, try it with precommpute=False and then run learn to see.


Thank you for the amazing previous explanations.

I’ve two remaining questions:

  1. When it comes to “weights” vs “activations” are they one and the same? Do weights create activations?

  2. When you say “This transformation by activation of your data is done only one time and now the new values of your data can be used as inputs of the last layers of your new NN that you are about to train” – what transformations are you referring to?

Hey, curious why data augmentation has to be impacted by whether or not we use precomputed activations. From what I’ve gathered, the point of data augmentation is to create more data (by cropping, zooming in, flipping, etc.) so that our network is less biased. So it seems like augmented data should be treated the same as the original images.

Why, when precompute=True, can’t we use the precomputed activations to train our augmented images the same way we do with the originals? You say it’s because the activation of our data is computed only one time, so why not that one time be with the precomputed activations?

1 Like