Maintaining embeddings trained from wikitxt even when they are absent from target corpus

dreambeats · March 4, 2019, 3:04pm

I just tried to do some classification on a dataset with a relatively small corpus. AFAIK, the language model pretrained on wiki text has a vocab size of 60k, and I realised that after running that through my corpus, only 4000ish words managed to make the cut. The model is trained on tweets about airlines, and I feel that the small vocabulary makes sentiment analysis a relatively harder task. Does it make sense for me to want to transfer more embeddings pretrained on wiki text to my model, I realise that those words are not present in my target corpus and will never be trained. But at least this way during inference, instead of those words getting treated as 'unk’s the model actually has some idea about them. I wonder if this would hurt performance? It obviously won’t do much to validation accuracy of the classification model since the ‘extra’ words from wikitext have nothing to do with the test set. Perhaps I will try this out if I can figure out how to tweak the API a little bit.