As I understand it a normal language (word) model has a (word) embedding vector as input but 1-hot vector as output. The loss is then cross-entropy between the actual and predicted.
But if the next word is “marvellous” and it predicts “wonderful” shouldn’t that get a lower loss than if it predicts “badger”? I understand it’s a bit more subtle than this since it predicts a probability for each of the words in its vocabulary but still, is there a way to score it higher if it predicts the wrong words but keeps the right meaning?
I was wondering if anyone has tried outputting an embedding vector instead of a 1-hot vector? Then you would need to find the nearest neighbour to the output vector to get the predicted word.