Embedding Matrix

umar · June 21, 2018, 11:38am

Can some one tell me if use of embedding matrix would be better than oneHotencoding for categorical variable in Machine Learning ? Also can we use embedding matrix for binng variable also?

tenoke · June 21, 2018, 3:05pm

Generally, embedding is usually better, especially when you can learn different stuff about the different categories (e.g. A tends to be colder, smaller, etc. while B is hotter, and smaller, C leans colder and larger but sometimes hotter etc.), and one hot encoding is good enough (and possibly better by being simpler) when there are no specific things to learn about the different categories other than them being different categories (e.g. Rock vs Paper vs Scissors in RPS).

cqfd · June 21, 2018, 5:05pm

Embeddings are actually equivalent to one-hot encodings (plus a linear layer)—their advantage is just that they’re more efficient.

Each embedding gets stored as a row (or maybe a column, doesn’t really matter) in an embedding matrix, and when you go grab the embedding for input i, you just look up row i in the matrix. This is mathematically equivalent to

[0,\ldots,1,\ldots,0] \cdot E

where E is the embedding matrix, and the row vector on the left is a one-hot encoding with the slot i equal to one. (That is, you can check that this product returns row i of E.)

The advantage of using an embedding is that you don’t actually have to do that matrix multiplication: you can just grab the i th row.

gevezex · November 1, 2018, 7:57am

How would this be represented in a network architecture plot?

I mean you have a these continues inputs, what is a matrix of size (m x n_continues )
and these are probably fed to a linear layer and so forth (in batches)…
Then you have these categorical features what you first multiply by these embedded matrices. So for example if one of the features have a cardinality of 8 then you would have a 8x4 matrix what represents this feature. But how and where would this fit in the picture? And would this matrix itself would be the linear layer or do we need to multiply it with the linear layer of the continues feature matrix?

I have trouble to visualize the network architecture of how this looks like.