Extracting Embedding Matrix from Collaborative filtering model


(Zarif) #1

Hi guys,

I have fitted my Collaborative Filtering model to my data and it fits well enough.

Now, my main goal is to obtain the accurate embedding matrix that describes each User perfectly. Has anyone extracted the Embedding Matrix from the fastai source code?

It’s specifically in the ‘collab.py’ file under ‘class EmbeddingDotBias(nn.Module)’ in the forward() function. I need a way to obtain the self.u_weight(users), which is the embedding matrix for users, from my Jupyter Notebook. Does anyone know how to do this?

Let me know and we can share ideas :slight_smile:


(Zarif) #2

Since posting my question I’ve learnt that what I’m looking for is below:

learn.model.u_weight
output: Embedding(64382, 40)

Now I just have to figure out how to extract an actual matrix of numbers from that embedding object. Anyone has any ideas??


(antoine mercier) #3

I think you could do it this way:

learn.model.u_weight.weight

This would give a tensor of weights. If you want to convert it to a numpy array you can wrap it with to_np():

to_np(learn.model.u_weight.weight)

or equivalently:

learn.model.u_weight.weight.data.cpu().numpy()


(Zarif) #4

thank you Antoine


(Ying Xie) #5

I’m also trying to extract the embedding trained using an NN. In fastai, the practice is to make the embedding size that of (categories + 1, output dimension), where the +1 is for unknown. This means the pytorch layer embedding tensor would have one more than my categories. My question is which index would the unknown map to?

For instance, suppose I have a catorical variable a < b < c, where the embedding tensor size is (4, 2). Would category ‘a’ map to embedding tensor index 0, or 1?

Thanks in advance for any help.


(antoine mercier) #6

Suppose you have a categorical variable named var1 in a list cat_vars:

cat_vars = ['var1','var2','var3']

data = TabularDataBunch.from_df(path, df, dep_var,
                                  valid_idx=valid_idx,
                                  procs=procs,
                                  cat_names=cat_vars,
                                  cont_names=cont_vars,
                                  bs=128)

learn = tabular_learner(data, layers=[1000,500],
                          emb_szs=emb_szs,
                          ps=[0.05,0.05], emb_drop=0.05, metrics=rmse)

Then you can access the categories encoded in the corresponding embedding matrix as follows:

learn.data.classes['var1']

This command returns:

array(['#na#', '135-04003-401', '135-04005-001', '145-20231-001', ..., 'V57466215 000 00', 'V57466216 200 00',
       'V57466248 200 00', 'V57466517 001 00'], dtype=object)

As you can see in the above example the category for the ‘unknown’ class is at index 0 ('#na#').

Note that in my experience this works only for regression tasks. In a classification task this command returns the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-123-ea9f310c586e> in <module>
----> 1 learn.data.classes['PN']

TypeError: list indices must be integers or slices, not str

Therefore there seems to be an inconsistency in the source code worth checking by the developers.


(Ying Xie) #7

Thanks for the suggestion. It didn’t work for me. I’m using fastai v0.7 and learn.data doesn’t have “classes”. I’ll poke around some more to see if I can find the equivalent.