im in the third lesson now and im confused on what will freeze and unfreeze do and when should we use them? Thank you <3
unfreeze effectively allow you to decide which specific layers of your model you want to train at a given time (I believe it does this by setting
False to turn off training for that layer). I believe this is done because we often use transfer learning, and the early layers of our model are already going to be well trained to doing what they do, recognizing basic lines, patterns, gradients…etc, but the later ones (which are more specific to our exact task, like identifying an animal breed) will need more training.
unfreeze will unfreeze all layers of your model, so you will be training the early and later layers, although you still may be training the different layer groups at different learning rates. This is called ‘discriminative learning rates’ or ‘discriminative layer training’.
freeze will set all of your layer groups except the last one to be untrainable. It appears from the documentation that this means we freeze the first layer group (the one that comes from transfer learning) and unfreeze the second (also last) group, to train more.
If you know the details of your architecture and want to do something in between
freeze you can use
freeze_to(n:int) to specify which layer groups you want to freeze and which you want to train. The first n layer groups will be frozen and the last n layer groups will be unfrozen.
Maybe someone more experienced than me can help us out and answer.
- What are some best practices/common patterns for using
unfreezewith and without transfer learning?
- What are the default settings? When I run a few epochs initially is it on all layer groups? Or just the last one? What about after I finish when I go to train some more?
Hope that helps!
For visual explanation and a bit more, Jeremy explains where and when freeze/unfreeze is applied for a typical fine-tuning/transfer-learning CNN model in fast.ai through the end of Lesson 4 and start of Lesson 5.
Same concept is applied to other types of models too.
The following links are lesson notes from Hiromi, and they also contain links to the relevant parts in the lesson videos.
Sometimes I like to try to abstract away from the practical explanation of what a function or technique, such as freezing and unfreezing, does and how it works technically to building broader intuition surrounding why it works in principle. While I now understand from the explanations provided throughout the forum that the weights in earlier, pre-trained layers are being updated based in information used to train the latest layer, my colleagues and I were looking for an intuitive explanation that someone without knowledge of neural nets might be able to relate to. To this end we came up with the following human learning analogy that resonates with us, but I’d be interested if anyone might want to critique it or provide us with a better one. The analogy is as follows:
We accumulate knowledge over time as we observe and learn, and what we learn builds on what we have already learned before without having to relearn it (i.e. transfer learning). However, what we have learned in the past doesn’t go unmodified by what we are currently learning - new observations and training leads to new insights that may cause us to go back and modify or adjust (i.e. unfreeze) some of what we have learned earlier, potentially correcting, improving or deepening what we thought we knew, effectively seeing some of it with new eyes (i.e. fine-tuning some of the weights used to interpret earlier layers’ inputs, thus improving those earlier layers outputs). This improved understanding of what I learned earlier informs and improves my ability to learn new stuff (i.e. gives the new layer that I am training on better inputs to start with).
I can relate this to how I learned in my geometry / trig class by rote that the formulas for the circumference of a circle is 2pir, for the area of a circle is pir^2, and for the volume of a sphere is 3/4pi*r^3. However, when I took calculus the next year, I suddenly understood how they are all related, how they build on each other, and thus how they can be derived from the ground up using integral calculus. This deepened my understanding of what I had already learned, moved me beyond rote learning so that I it wasn’t so hard to remember the formulas, and maybe even allowed me to correct some formulas that I had gotten wrong to begin with.
Is this a reasonable analogy, or does someone have a better one? Perhaps this is obvious, but I do like thinking of ways to relate these things to how my brain works. Thanks for humoring me.
Oops. Just noticed that my “volume of a sphere” formula contains a typo: should be 4/3 * pi * r^3. I suppose this is a case in point - that earlier layer of mine needs a little unfreezing and retraining.
@cmvandam Thats a really well worded write up. But in my opinion your analogy and the geometry/trig example misses the reason why we go back to unfreeze and retrain the initial layers. Here we do that when our later layer group doesn’t perform (recongnize shapes for our particular dataset) as well as we want them to be. So, we are just rebuilding our basic understanding of shapes to suit our particular dataset.
If you want ot think of this in terms of a real world analogy, probably how we sometimes struggle with a new mobile phone when we switch brands comes somewhat close. We might know the general working of the phone but how we change settings for a particular function could be different in this new phone from how it was in you older different branded phone. So we dig deeper to find that setting which is different in the new phone.
Hope I am not confusing you.
Thank you very much, qsa007. Your explanation and alternative analogy is indeed very helpful. The issue I think you are pointing out with the geometry/trig analogy is that it implies that the earlier learning was somehow less complete or less correct than after it was unfrozen and retrained with the additional data. You are pointing out that this is not what is really going on. In fact, the unfrozen and then retrained layers aren’t necessarily any better per se, just better adapted to the problem at hand, i.e. not everything I learned about my iphone, especially the most specialized learning that typically happens later in the learning process, will be relevant and may even be somewhat misleading. I see now more clearly how not just unfreezing, but potentially even discarding, some of the more specialized / later layers of the model we are building on is therefore helpful. Thanks for taking the time to respond and help me out.
I think driving a bike is a good analogy (assuming, of course, You have already learned how to do it). When You buy a new bike, You still need to adapt how to ride it first (it’s a different one), and after You do it, You can begin serious training- just raiding it without thinking about how to steer it.