Sanity check on unfreeze() and epochs

ThomM · March 21, 2019, 5:03am

Is it correct that if I set up a transfer learning model, train the learner for 4 epochs, unfreeze it, and train it for another 4 epochs, the head of the network has been updated 4 times and the new layers have been updated 8 times? This is just a sanity check really, I’m pretty sure that’s right. Is there a way of finding out based on the learner object itself? I looked in the docs (and here on the forum) but wasn’t able to find a clear answer.

xeTaiz · March 21, 2019, 8:11am

Not quite.
If you train for 4 epochs, that means that you have shown every piece of your data 4 times.
The number of updates is how often you actually adjust the weights, which happens after every iteration (using batch_size items at once). So your head will have 8 * it_per_epoch updates and your backbone has 4 * it_per_epoch updates.
Also note, the new layers on top are called the model head. The transferred part is the backbone.
You can check your progress bars to see the number of iterations (the lower progress bar, the upper one shows your epoch)
You can also compute it this way:
iterations_per_epoch = num_items / batch_size
num_updates = num_epochs * iterations_per_epoch

Hope this helps

NathanHub · March 21, 2019, 8:51am

When you create a new pretrained learner with cnn_learner, it will be by default frozen (see this line in the __init__: if pretrained: learn.freeze()), meaning that all but the head of the network (created automatically with the good parameters according to your dataset) will be set to set_trainable(l, False).

It then means that the whole pretrained part of your network won’t be updated during the training. But when you unfreeze your network, all your layers will be set to ``set_trainable(l, True)`, so now they are updated (reason why the training is now slower).

So to answer to your question, the head of your network will be updated during 8 epochs (but not only 4 times as mentionned by @xeTaiz), and the pretrained part will be updated during 4 epochs

ThomM · March 21, 2019, 9:52am

Aha! So it seems like I was right about what’s happening with the freezing & unfreezing, but I totally misunderstood what an epoch is. Thanks!

So is it correct that an epoch is:

Take a subset of batch_size items from the training set
Put that subset through the various layers to end up with a prediction for each item
Use that prediction to calculate the loss function for each item
Update all the parameters according to their gradient with respect to that loss function ~~minus~~ times the learning rate <-- thanks @NathanHub
GOTO 1 until all items in the training set have been processed (i.e. len(data.train_ds / batch_size)

So strictly speaking by the end of a call to fit_one_cycle your weights have been updated num_epochs * (num_items / batch_size) rather than just num_epochs. That’s really great (and important!) to know, especially when it comes to thinking about performance implications. Thank you so much for the clarification.

So I guess I would’ve been less wrong if I had phrased my initial question as “the backbone of the network has been trained for 4 epochs and the head has been trained for 8 epochs”, is that right?

Thank you both for your detailed replies, very helpful!

And thanks for the “head” vs “backbone” distinction, that will also make things much easier

Typing this “out loud” it makes me wonder - does the minibatcher select items with replacement or not? I’ll dig into the source and see if I can find out.

NathanHub · March 21, 2019, 10:09am

Yes it seems that you now got it !

Just to remember you this is how each weight is updated: weight = weight - learning_rate * gradient

So I would maybe just rephrase this as: ‘Update all the parameters according to their gradient with respect to that loss times the learning rate’

Also, if you want to have more information about the reasons why we only process a subset (i.e. a mini batch) of the training set each time, take a look at this blog post. Basically it is to take advantage of vectorization capabilities of you GPU and to be computationally efficient.

ThomM · March 21, 2019, 10:17am

Thanks for fine tuning my mental model there

And thanks for the blog post. I understand all the benefits of minibatching training - I guess I’d just never really joined the dots about batches vs. epochs.

I’m glad I asked and even more glad you & @xeTaiz responded, thanks again.

xeTaiz · March 21, 2019, 12:11pm

Perfectly correct now!

So I guess I would’ve been less wrong if I had phrased my initial question as “the backbone of the network has been trained for 4 epochs and the head has been trained for 8 epochs”, is that right?

Yep

Typing this “out loud” it makes me wonder - does the minibatcher select items with replacement or not? I’ll dig into the source and see if I can find out.

In general without replacement. Basically what it does is take a list of indices to all your items range(len(items)) and shuffle it. Then pick batch_size at a time. If you pick with replacement, you do not ensure that all of your training data was actually shown to the network.
However note that this might still be the case if len(items) / batch_size is fractional, meaning your last batch would have to be smaller than batch_size, because there is no more items left. I’m honestly not quite sure how this is handled for images in fastai / pytorch.