How are text embeddings generated using neural networks?
– This is how I think, but I’m not sure how we would get the training data.
For a given word if we want a 50 representative vector, the final layer would have 50 nodes. Is this correct?
Moreover in the paper, Devise: A deep visual-semantic embedding model, in the section, language model pre-training section it is mentioned that, “Our skip-gram model used a hierarchical softmax layer for predicting adjacent terms and was trained using a 20-word window with a single pass through the corpus.” So how would predicting the adjacent words give us these embeddings? The embeddings that we get, are those the corresponding weights?