I’m having a a little trouble wrapping my head around embeddings, and particularly the difference between the day of week example described in lesson 4 vs things like word embeddings. (Maybe this is explained more later?)
Here’s my confusion. Let’s suppose we have a toy model where we’re just predicting something (rainfall, say) based on just one feature, day of week. And let’s assume it’s just a simple bag of words representation.
In old school one-hot encoding, we’d have 7 columns in our DataFrame (or, I guess six, at least with linear regression and such, but I don’t know how that stuff translates to neutral nets, so let’s say 7), and exactly 1 of those columns is 1 for every row (each observation appears on exactly 1 day of the week).
Then, when we train a word embedding (let’s assume it’s 4 columns), we end up with 4 columns in our new DataFrame, and each of those columns has a float in it, but those 4 columns still represent a single day of the week, because there’s still only one day per observation. So we can just directly replace our 7 one-hot-encoding columns with 4 embedding columns.
That far, I understand.
But now suppose we’re trying to train word embedding. Suppose we have a bunch of sentences and a vocabulary with 12 words. “the,” “is,” “cat,” “dog,” “forward,” “back,” “comes,” “goes,” “in,” “out,” “hat,” “gloves.” So with traditional one-hot encoding I’d have 12 columns, and each column would have a 1 or a 0 depending on whether the word appeared in the sentence, and I’d predict on that. (Let’s assume I’m throwing out data about how many times the word appears.)
But now, suppose that we train a 6-dimensional embedding matrix. And suppose that I have the sentence “the cat in the hat comes back.”
How do I generate my DataFrame? I’ve got a different six-dimensional vector for each word in my observation. I assume I just don’t concatenate them, ending up with a 36-dimension DataFrame (the 6 dimensions for “the” plus the 6 dimensions for “cat” and so forth). For one thing, that would immediately blow up because you would have different numbers of columns for each row (or is that ok, with RNNs or something?). Or do we take a linear combination of them or something? Like, is a document represented by the mean of the vectors for each word? The sum? The product? All these things seem, intuitively, shaky to me…you could imagine very different sentences might end up with similar mean vectors, for example? But maybe that’s ok, and it turns out that more or less corresponds to similar meanings?
Love any insight on this… Thanks!