So, in transfer learning we have a function that is already optimized for specific task (say resnet34) with all the weights and biases. That means that loss on that architecture is stuck in local minima. How come that after removing ‘head’ of the resnet34 model (200 classes → 200 weights → 200 dimensions), and then adding the layer with two weights (cats vs dogs, 2 more dimensions) we just need to randomize those weights in the last layer, freeze all the other weights and biases (whole resnet34 architecture minus 200 classes → 200 weights) and train for one epoch to update just those two weights, and then ‘at least’ one more epoch with all the weights and biases (resnet34 - resnet_34_head + dogs_vs_cats_head) and we get state of the art result in accuracy?
I think in the second lecture Jeremy talks about Zeiler and Fergus and a paper they authored in 2012. Transfer learning must have been discovered since then. There sure isn’t much literature about it’s origin and Jeremy said it wasn’t his idea.
When we fine-tune a model for a different task, what we are really interested in is the knowledge it got from the previous task because most images all have the same basic building blocks.
We don’t care about things like its accuracy and in your case, the loss of its previous tasks. I don’t think stuff like that are transferable from one network to another.
After creating a new random linear classifer layer for our specific task, we start with a new loss function and metrics. Remember, loss function is a measure of how good our model is doing for our specific task.
What is transferable from the initial task to your new task is the earlier layers that learn fundamental shapes and lines