I have just tried to read this blog/paper on how word representations evolve in transformers depending on the learning task.
I’ll paste the short conclusions here, that hint that language models might not be the best pre-training tasks, if I got that right. Could this mean that implementing masked-language model might improve transfer learning performance for NLP in Fastai?
From the paper (LM = language modeling, MT = machine translation, MLM = masked language modeling):
Now, summarizing all our experiments, we can make some general statements about the evolution of representations. Namely,
- with the LM objective, as you go from bottom to top layers, information about the past gets lost and predictions about the future get formed
- for MLMs, representations initially acquire information about the context around the token, partially forgetting the token identity and producing a more generalized token representation; the token identity then gets recreated at the top layer
- for MT, though representations get refined with context, less processing is happening and most information about the word type does not get lost
This provides us with a hypothesis for why the MLM objective may be preferable in the pretraining context to LM. LMs may not be the best choice, because neither information about the current token and its past nor future is represented well: the former since this information gets discarded, the latter since the model does not have access to the future.