Word2vec/skipgram arch/loss question

pl3 · November 2, 2018, 12:49pm

I’m looking at using a word2vec/skipgram model to learn embeddings for products that appear in a sequence similar to words in a sentence.

Here is the paper I am using as a reference:

Here is an example of airbnb using this type of model for their listings:

I am having trouble understanding the idea behind the loss function (eq 2,3 in first link, here is a code example fo the positive case: https://github.com/theeluwin/pytorch-sgns/blob/master/model.py#L73). The idea is to “push”/update the embedding vector of the input word closer to the output words and further away from the negatively sampled words.

From my understanding, the loss (1/(1+e^(v_i@v_o)) gets better when the dot product of the input and output vectors is larger. What I don’t get is why essentially using the dot product to compare the similarity of 2 vectors is the best approach. Does anyone have a good explanation?