As I understand it a normal language (word) model has a (word) embedding vector as input but 1-hot vector as output. The loss is then cross-entropy between the actual and predicted.
But if the next word is “marvellous” and it predicts “wonderful” shouldn’t that get a lower loss than if it predicts “badger”? I understand it’s a bit more subtle than this since it predicts a probability for each of the words in its vocabulary but still, is there a way to score it higher if it predicts the wrong words but keeps the right meaning?
I was wondering if anyone has tried outputting an embedding vector instead of a 1-hot vector? Then you would need to find the nearest neighbour to the output vector to get the predicted word.
I made a comment along the same lines in the advanced thread for the lesson. Here’s my thoughts on the idea since:
The main issue I see is that you create and change the word vectors as part of training the model. So that begs the question - what is your ground truth? Between the start and end of training, the word vectors for “marvelous” and “wonderful” will change, which means the similarity between the two vectors will change.
Do you have a separate set of static word vectors that you use to train the model? If so, does the performance of your model tied to the quality of your ground truth word vectors?
Maybe you could have a hybrid loss function where your cross entropy loss is increased or decreased by some sort of similarity score. You’d need to design it in such a way that the model does not update the ‘ground truth’ word vector in the process.
Yes it’s a good idea, and @honnibal recently tried it, although he hasn’t really done enough training of the model to test it fully.
i wonder if the weight-tying used in seq2seq as described in this post could be used in this situation. it could be possible to tell that marvelous is closer to wonderful than badger
And oh look here’s a paper on the subject