Get Embedding Vectors from StructuredModel

martin.m · February 27, 2018, 12:13am

Hi @ all,

is there a way to extract the embedding layers/vectors from the final trained model when using StructuredLearner from column_data?

I have trained a net as in the lesson 4 - Rossmann example, but on a different dataset (used cars from Kaggle). Now I want to visualize the embeddings (after dimensionality reduction), in order to show for example which brands are close to each other etc.

The embeddings must be stored in model object in some way I assume, as is visible from learner_object.get_layer_groups(), but I haven’t found a way to get them.

Any help highly appreciated!

kcturgutlu · February 27, 2018, 12:33am

Try net.parameters()

martin.m · February 27, 2018, 5:55pm

Unfortunately, it tells me StructuredLearner does not have an attribute parameters. Or am I doing something wrong here?

My code is the following:

cars = pd.read_csv("autos.csv", encoding = "latin-1")

cat_vars = ["seller", "offerType",
              "abtest", "vehicleType","gearbox", "model", "fuelType",
              "brand", "notRepairedDamage"]
excl_vars = ["name", "dateCrawled", "dateCreated", "lastSeen"]

for cat in cat_vars:
    cars[cat] = cars[cat].astype("category")

x, y, nas, mapper = proc_df(cars[:50000], "price", skip_flds=excl_vars, do_scale=True)

cat_sz = [(c, len(cars[c].cat.categories)+1) for c in cat_vars]

emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

val_idx = np.random.randint(0, len(x), int(len(x)*0.15)); val_idx

model_data = column_data.ColumnarModelData.from_data_frame("/home/nbuser/", 
                                                           val_idxs=val_idx, df=x, y=y.astype(np.float32), 
                                                           cat_flds=cat_vars, bs = 64)

model = model_data.get_learner(emb_szs, len(x.columns)-len(cat_vars),
                   0.04, 1, [1000, 500], [0.001, 0.01])
lr = 1e-3

model.lr_find()

model.fit(lr, 3)

## this is not working...
model.parameters()

wgpubs · February 27, 2018, 8:35pm

They are accessible in MixedInputModel.embs

You should be able to get at them via learner.model.embs (or something like this).

martin.m · February 28, 2018, 6:28pm

Thank you very much! It was a mix between both suggestions. In my case

model.model.embs.parameters()

Do you by any chance also know, how to match each embedding vector to the corresponding category? The names do not seem to be included in the output of parameters()…

Tchotchke · March 16, 2018, 7:24pm

It will be in the same order as your cat_vars variable. If you want to check for yourself, first take a look at emb_szs, then do cars[cat_vars[0]].value_counts() and you should see that the number of unique values in that variable is equal to the first emb_szs (minus one).

MarKo · April 4, 2018, 2:08am

Just in case anyone else is struggling with getting the embeddings, below is what i did.

embs = list()
for param in learn.model.embs.parameters():
    embs.append(param)

then you can call the matrices from the list in the order of the cat variables.
embs[0], embs[1] etc.

If there is a more efficient way any suggestion would be appreciated.

tetelias · June 8, 2018, 10:28pm

Haven’t found an answer anywhere on the forum, so:
after getting embeddings from the model, how do I map them to values in categorical columns?
Assuming df is the dataframe I got from applying proc_df to train set(and df_test made from test set), do I just replace values there, which are all numerical now, say, 7 or 10 with embedding_matrix_of_the_column[7] or embedding_matrix_of_the_column[10] respectively?

shady · June 28, 2018, 7:12pm

I have a similar question regarding the embeddings. For each categorical variable, are the rows of the embeddings in the order 1, 2, 3, … to the last number?

sjdlloyd · June 28, 2018, 8:36pm

The embedding index is the same as the categorical index use pass in from DataFrame. This is the equivalent what happens with proc_df

shady · June 29, 2018, 6:59pm

Thank you for the reply. Just to be clear, if in your example I create an embedding matrix of dimension 2, and the tensor for ‘thing’ is
[ 0.2 0.4
0.1 0.1
0.7 0.2],
then the first row corresponds to Cake, the second to Fish, and the third to Mellon?

Nadav · August 28, 2018, 10:59pm

embedding_names = ['UNKNOWN'] + list(joined[col_name].cat.categories)

joined - your data frame
col_name - the column embedded