Hi @ all,
is there a way to extract the embedding layers/vectors from the final trained model when using StructuredLearner from column_data?
I have trained a net as in the lesson 4 - Rossmann example, but on a different dataset (used cars from Kaggle). Now I want to visualize the embeddings (after dimensionality reduction), in order to show for example which brands are close to each other etc.
The embeddings must be stored in model object in some way I assume, as is visible from learner_object.get_layer_groups(), but I haven’t found a way to get them.
Any help highly appreciated!
Unfortunately, it tells me StructuredLearner does not have an attribute parameters. Or am I doing something wrong here?
My code is the following:
cars = pd.read_csv("autos.csv", encoding = "latin-1")
cat_vars = ["seller", "offerType",
"abtest", "vehicleType","gearbox", "model", "fuelType",
excl_vars = ["name", "dateCrawled", "dateCreated", "lastSeen"]
for cat in cat_vars:
cars[cat] = cars[cat].astype("category")
x, y, nas, mapper = proc_df(cars[:50000], "price", skip_flds=excl_vars, do_scale=True)
cat_sz = [(c, len(cars[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]
val_idx = np.random.randint(0, len(x), int(len(x)*0.15)); val_idx
model_data = column_data.ColumnarModelData.from_data_frame("/home/nbuser/",
val_idxs=val_idx, df=x, y=y.astype(np.float32),
cat_flds=cat_vars, bs = 64)
model = model_data.get_learner(emb_szs, len(x.columns)-len(cat_vars),
0.04, 1, [1000, 500], [0.001, 0.01])
lr = 1e-3
## this is not working...
They are accessible in
You should be able to get at them via
learner.model.embs (or something like this).
Thank you very much! It was a mix between both suggestions. In my case
Do you by any chance also know, how to match each embedding vector to the corresponding category? The names do not seem to be included in the output of
It will be in the same order as your
cat_vars variable. If you want to check for yourself, first take a look at
emb_szs, then do
cars[cat_vars].value_counts() and you should see that the number of unique values in that variable is equal to the first
emb_szs (minus one).
Just in case anyone else is struggling with getting the embeddings, below is what i did.
embs = list()
for param in learn.model.embs.parameters():
then you can call the matrices from the list in the order of the cat variables.
embs, embs etc.
If there is a more efficient way any suggestion would be appreciated.
Haven’t found an answer anywhere on the forum, so:
after getting embeddings from the model, how do I map them to values in categorical columns?
Assuming df is the dataframe I got from applying proc_df to train set(and df_test made from test set), do I just replace values there, which are all numerical now, say, 7 or 10 with embedding_matrix_of_the_column or embedding_matrix_of_the_column respectively?
I have a similar question regarding the embeddings. For each categorical variable, are the rows of the embeddings in the order 1, 2, 3, … to the last number?
The embedding index is the same as the categorical index use pass in from DataFrame. This is the equivalent what happens with proc_df
Thank you for the reply. Just to be clear, if in your example I create an embedding matrix of dimension 2, and the tensor for ‘thing’ is
[ 0.2 0.4
then the first row corresponds to Cake, the second to Fish, and the third to Mellon?
embedding_names = ['UNKNOWN'] + list(joined[col_name].cat.categories)
joined - your data frame
col_name - the column embedded