Get Embedding Vectors from StructuredModel

Hi @ all,

is there a way to extract the embedding layers/vectors from the final trained model when using StructuredLearner from column_data?

I have trained a net as in the lesson 4 - Rossmann example, but on a different dataset (used cars from Kaggle). Now I want to visualize the embeddings (after dimensionality reduction), in order to show for example which brands are close to each other etc.

The embeddings must be stored in model object in some way I assume, as is visible from learner_object.get_layer_groups(), but I haven’t found a way to get them.

Any help highly appreciated!

3 Likes

Try net.parameters()

Unfortunately, it tells me StructuredLearner does not have an attribute parameters. Or am I doing something wrong here?

My code is the following:

cars = pd.read_csv("autos.csv", encoding = "latin-1")

cat_vars = ["seller", "offerType",
              "abtest", "vehicleType","gearbox", "model", "fuelType",
              "brand", "notRepairedDamage"]
excl_vars = ["name", "dateCrawled", "dateCreated", "lastSeen"]

for cat in cat_vars:
    cars[cat] = cars[cat].astype("category")

x, y, nas, mapper = proc_df(cars[:50000], "price", skip_flds=excl_vars, do_scale=True)

cat_sz = [(c, len(cars[c].cat.categories)+1) for c in cat_vars]

emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

val_idx = np.random.randint(0, len(x), int(len(x)*0.15)); val_idx

model_data = column_data.ColumnarModelData.from_data_frame("/home/nbuser/", 
                                                           val_idxs=val_idx, df=x, y=y.astype(np.float32), 
                                                           cat_flds=cat_vars, bs = 64)

model = model_data.get_learner(emb_szs, len(x.columns)-len(cat_vars),
                   0.04, 1, [1000, 500], [0.001, 0.01])
lr = 1e-3

model.lr_find()

model.fit(lr, 3)

## this is not working...
model.parameters()
1 Like

They are accessible in MixedInputModel.embs

You should be able to get at them via learner.model.embs (or something like this).

Thank you very much! It was a mix between both suggestions. In my case

model.model.embs.parameters()

Do you by any chance also know, how to match each embedding vector to the corresponding category? The names do not seem to be included in the output of parameters()

1 Like

It will be in the same order as your cat_vars variable. If you want to check for yourself, first take a look at emb_szs, then do cars[cat_vars[0]].value_counts() and you should see that the number of unique values in that variable is equal to the first emb_szs (minus one).

1 Like

Just in case anyone else is struggling with getting the embeddings, below is what i did.

embs = list()
for param in learn.model.embs.parameters():
    embs.append(param)

then you can call the matrices from the list in the order of the cat variables.
embs[0], embs[1] etc.

If there is a more efficient way any suggestion would be appreciated.

2 Likes

Haven’t found an answer anywhere on the forum, so:
after getting embeddings from the model, how do I map them to values in categorical columns?
Assuming df is the dataframe I got from applying proc_df to train set(and df_test made from test set), do I just replace values there, which are all numerical now, say, 7 or 10 with embedding_matrix_of_the_column[7] or embedding_matrix_of_the_column[10] respectively?

I have a similar question regarding the embeddings. For each categorical variable, are the rows of the embeddings in the order 1, 2, 3, … to the last number?

The embedding index is the same as the categorical index use pass in from DataFrame. This is the equivalent what happens with proc_df

1 Like

Thank you for the reply. Just to be clear, if in your example I create an embedding matrix of dimension 2, and the tensor for ‘thing’ is
[ 0.2 0.4
0.1 0.1
0.7 0.2],
then the first row corresponds to Cake, the second to Fish, and the third to Mellon?

embedding_names = ['UNKNOWN'] + list(joined[col_name].cat.categories)

joined - your data frame
col_name - the column embedded

2 Likes