Let’s say we are creating word embeddings to help us crack something in NLP. The words in an embedding are usually given random keys(WordIDs) irrespective of the parts-of-speech. Would the network converge faster if all the similar words with similar parts-of-speech were clubbed together?
The reason for asking this question stems from the fact that we are basically doing a lot of vector and matrix operations which (crudely saying) tries to maximize the probability for a certain word to come next. Add to that the fact that we are mostly using the frequent words in the embeddings and removing the rare words. This way we can help the network find the correct spectrum of WordIDs to look into when predicting the next word.
I understand that a certain word could have multiple parts-of-speech tags (across different texts), but I would like to suggest having multiple WordIDs for such words to take care of the inherent grammar constraints in natural language. Even in most of the work around Neural Machine Translation, we usually find that all word forms (with the same origin/root) would have different WordIDs anyway (e.g. word_origin = “swim” and verb_forms = [“swim” ,“swimming”, “swam”, “swims”]; all the verb forms are unique in this case).
Any advice/intuition/proof if this was ever (or could be) observed? If yes, could this be a best practice?
My guess is that it should help the network converge faster but will increase the embedding dictionary size. Looking for a constructive discussion on this .
Additionally, should we sort the dictionary alphabetically?
Please let me know if this makes sense or if I should rephrase any parts.