When we train a RNN for example for translation, using precalculated word2vec embeddings, are these embeddings changing their values during training or their values are frozen?
Initially they do not change. But in the end if we have large data and sufficient compute we can fine-tune them also.
I think the answer depends on how the network is configured, and whether it is a good idea depends on your goal. For translations I do not have a general answer on what the best strategy is. So far, I’m not very helpful
When using a pre-trained model and you want to finetune it for a very specific domain it could make sense to also change embeddings but not the RNNs: there could be many words that didn’t occur frequently enough in the original set. Sentences could have the same structure, so keeping those parts frozen seems like a good idea.
But today I happened to check which parameters are updated in gradient descent (i.e. frozen/not frozen) in Fast.ai language model learner with a pretrained
AWD LSTM model. Perhaps this helps you?
(Disclaimer: I wrote the code just few minutes ago, and I think I inspect the correct values to determine if it is frozen or not. Basically I check the property
requires_grad for parameters.)
The structure is roughly:
- encoder [Not frozen]
- encoder_dp [Not frozen]
- rnns [Frozen]
- input_dp [Frozen]
- hidden_dps [Frozen]
- LinearDecoder [Not frozen]
There is a post here in which there is a discussion on how to tell which layers are frozen, terminology and the order for freezing by Jeremy Howard. (Note that the functions mentioned seemed to have changed, so I looked into the PyTorch docs to get the necessary information). My observation seems to match the comments made.
As I understand it, you can’t freeze a layer unless all prior layers are also frozen, because the gradients won’t propagate backwards past the frozen layer. Therefore, you can’t unfreeze the embedding layer if any downstream layer is frozen.