How to deal with words not in embedding? Question regarding seq2seq model training

dreadloaf · August 17, 2019, 7:54pm

Hi, I am currently trying to implement a seq2seq model using pytorch and am concerned with a potential issue. So far I have created my own word2vec model to generate word embeddings that I am going to use in my seq2seq model. My concern lies in the case when a word is introduced that the embedding does not know about. Right off the bat I am worried about how the tags ‘beginning of stream’, ‘end of stream’ and ‘pad’ are handled as those were not trained as a part of my word2vec model. Should they have been? Would it make sense to train these ‘special tokens’ from scratch during my seq2seq training? Thanks!

barnacl · August 19, 2019, 6:16am

From what I remember Rachel telling in the NLP course is that these tokens that have not been seen are initialised with the mean of the other tokens. And as you train them they should get updated. She shows an example of two unseen token that are initialised like this and how after training they get updated.
Check out the NLP video 8 and 5-nn-imdb.ipynb notebook.
“30-something” and “linklater” don’t have pretrained embedding so they are both given the average of the embeddings.

np.allclose(enc.weight[vocab.stoi["30-something"], :], 
            enc.weight[vocab.stoi["linklater"], :])

True
and after training if you run it again it returns False.
Hope that helps.