Embeddings where an observation can have multiple categories (lesson 4)

pgowder · February 28, 2018, 5:12pm

I’m having a a little trouble wrapping my head around embeddings, and particularly the difference between the day of week example described in lesson 4 vs things like word embeddings. (Maybe this is explained more later?)

Here’s my confusion. Let’s suppose we have a toy model where we’re just predicting something (rainfall, say) based on just one feature, day of week. And let’s assume it’s just a simple bag of words representation.

In old school one-hot encoding, we’d have 7 columns in our DataFrame (or, I guess six, at least with linear regression and such, but I don’t know how that stuff translates to neutral nets, so let’s say 7), and exactly 1 of those columns is 1 for every row (each observation appears on exactly 1 day of the week).

Then, when we train a word embedding (let’s assume it’s 4 columns), we end up with 4 columns in our new DataFrame, and each of those columns has a float in it, but those 4 columns still represent a single day of the week, because there’s still only one day per observation. So we can just directly replace our 7 one-hot-encoding columns with 4 embedding columns.

That far, I understand.

But now suppose we’re trying to train word embedding. Suppose we have a bunch of sentences and a vocabulary with 12 words. “the,” “is,” “cat,” “dog,” “forward,” “back,” “comes,” “goes,” “in,” “out,” “hat,” “gloves.” So with traditional one-hot encoding I’d have 12 columns, and each column would have a 1 or a 0 depending on whether the word appeared in the sentence, and I’d predict on that. (Let’s assume I’m throwing out data about how many times the word appears.)

But now, suppose that we train a 6-dimensional embedding matrix. And suppose that I have the sentence “the cat in the hat comes back.”

How do I generate my DataFrame? I’ve got a different six-dimensional vector for each word in my observation. I assume I just don’t concatenate them, ending up with a 36-dimension DataFrame (the 6 dimensions for “the” plus the 6 dimensions for “cat” and so forth). For one thing, that would immediately blow up because you would have different numbers of columns for each row (or is that ok, with RNNs or something?). Or do we take a linear combination of them or something? Like, is a document represented by the mean of the vectors for each word? The sum? The product? All these things seem, intuitively, shaky to me…you could imagine very different sentences might end up with similar mean vectors, for example? But maybe that’s ok, and it turns out that more or less corresponds to similar meanings?

Love any insight on this… Thanks!

shaun1 · May 6, 2018, 1:30pm

I think you’re confusing entity embedding and word embedding. Both of these have the same principle but serve different purposes.

Embeddings of categorical variables are called entity embeddings. Please refer to the paper that lesson 3 is based on. You can find it here. I would think that for entity embeddings order doesn’t really matter since there is no order to it. You have a bunch of entries in your data each of which has that particular category associated with it.

Now word embeddings are something else. I believe what you describe in the second half of you post would be akin to a language model, where given the previous word(s) the next word is predicted as show in lesson 4. Here, I would think that you would feed your model one/n word(s) one at a time. As far as how you would feed that, my guess (and others can correct me if I’m wrong) is you just concatenate (as you’ve mentioned) them and feed it in all at once.

MaheshKhatri · August 24, 2018, 11:56am

I recently gave a presentation on my understanding of Embeddings as applied to Real World Entities as part of a Twimlai Study Group.

This is the Medium Article Entity Embeddings package real world knowledge for AI Algorithms to the above presentation.

In case you have any questions, feel free to ask. I will strive to respond at the earliest on a best effort basis.