I’m in the part of the course where we discuss transfer learning (taking the last layer of pretrained model and adding new layer with random weights). I’m trying to grasp how that is. The prior layers already have trained parameters but we don’t use those and instead use random ones. When we train new layer for an epoch, after the fact do all the parameters come together and it’s start training on all of them? Just trying to wrap my mind around the math. Are the prior parameter tensors dimensions the same as the new layer? Thank you for the help in advance

Hi, this is how I understand transfer learning:

First we train a model with a different purpose, e.g. a Language Model, so all layers learn meaningful parameters that describe the meaning of words, grammar and word dependencies. Then we remove the last layer (the decoder), which predicts the output (being the word in a sentence for a LM), and instead add a new layer that predicts e.g. the class of a document in text classification.

Thus, all parameters in the model remain the same, except the last layer is removed and initialized with a new final layer. When we then train an epoch all parameters are updated at the same time, unless you use Gradual Unfreezing, which means that your first freeze the final layer and optimize those parameters alone.

The tensor dimensions remain the same in all layers except the final layer, as the final layer is a completely new layer.

Thank you for your help! So the total parameters for the new model would it be the activations of the all layers except the last plus the new random parameters of the new layer?

yes exactly!

But the activations are determined by the weights and biases, so they also are parameters to be included. And in case of NLP, the word embeddings also are parameters that are included.

Thank you! I’ve done the course one time so I’m re reviewing things i don’t understand fully. It’s so amazing how this course demystified deep learning.

That is my understanding:

let say you have a model that already trained images with 1000 classes. The model hay many layers and let say it ends with 512 activations so it has as last layer a parameter matrix 512x1000 to spit out a vector of 1000 activations (same as number of classes you predict).

When you have a new task i.e. classifying images in 10 classes you get that pretrained model, throw away that 512x1000 matrix and instead create a new 512x10 parameter matrix (initially random). When you train you just train that new matrix, other parameters are frozen. Then you can “fine tune”, unfreezing other parameters and training the whole thing again.