NLP | Effect of the order of words in the dictionary for words embeddings

Deepak_S · November 19, 2017, 4:44pm

Let’s say we are creating word embeddings to help us crack something in NLP. The words in an embedding are usually given random keys(WordIDs) irrespective of the parts-of-speech. Would the network converge faster if all the similar words with similar parts-of-speech were clubbed together?

The reason for asking this question stems from the fact that we are basically doing a lot of vector and matrix operations which (crudely saying) tries to maximize the probability for a certain word to come next. Add to that the fact that we are mostly using the frequent words in the embeddings and removing the rare words. This way we can help the network find the correct spectrum of WordIDs to look into when predicting the next word.

I understand that a certain word could have multiple parts-of-speech tags (across different texts), but I would like to suggest having multiple WordIDs for such words to take care of the inherent grammar constraints in natural language. Even in most of the work around Neural Machine Translation, we usually find that all word forms (with the same origin/root) would have different WordIDs anyway (e.g. word_origin = “swim” and verb_forms = [“swim” ,“swimming”, “swam”, “swims”]; all the verb forms are unique in this case).

Any advice/intuition/proof if this was ever (or could be) observed? If yes, could this be a best practice?
My guess is that it should help the network converge faster but will increase the embedding dictionary size. Looking for a constructive discussion on this .

Additionally, should we sort the dictionary alphabetically?

Please let me know if this makes sense or if I should rephrase any parts.

harveyslash · November 19, 2017, 6:43pm

I dont think the order matters.
After going through the embeddings ,the word ids are thrown away , and all you have is a dim long vector to represent that word.
It has nothing to with the word id, which is just a placeholder to find the right vector.
if the training is correct, similar words should have similar valued vectors.

machinethink · November 19, 2017, 7:50pm

When you input a sentence into the network, the Embedding layer gathers all the word vectors for those words and puts them together into a matrix.

So what the network sees is the vectors of the words next to each other (in the same order as in the sentence), even though the Embedding layer may store those vectors in different locations.

Like this: https://www.tensorflow.org/api_docs/python/tf/gather

Deepak_S · November 20, 2017, 6:21am

Thank you @harveyslash & @machinethink. I understood where I was amiss.

In that case, since most of the different word-forms are “mostly” unique it should also take care of POS tags on its own.

Great!! Thanks again