When fine tuning the LM it is said that - “We first tune the last embedding layer so that the missing tokens initialized with mean weights get tuned properly. So we freeze everything except the last layer.”
In the code this is done with the following line of code:
According to my understanding learner.freeze_to(-1) means unfreezing the top most layer which is not the embeddings layer. The embeddings layer is the bottom most or the first layer so I would expect to see learner.freeze_to(0).
Appreciate if you you can clarify this?