I have encoded my dataset using the Tabular module and training:
to_nn = TabularPandas(data, procs, cat, cont,
splits=splits, y_names='identified')
dls = to_nn.dataloaders(1024, device = device)
learn = tabular_learner(dls, layers=[500,250], n_out=1)
learn.fit_one_cycle(12, 3e-3)
and then looped to create new columns in OriginalColumn_n format:
for i,col in enumerate(learn.dls.cat_names[:5]):
emb = learn.model.embeds[i]
emb_data = emb(tensor(to_nn.train.xs[col], dtype=torch.int64))
emb_names = [f'{col}_{j}' for j in range(emb_data.shape[1])]
display(emb_names)
I am curious how (and should) I could best create a (figurative) dictionary to replace “j”, such that when I output feature importances, I get FRUIT_APPLE instead of FRUIT_0.
Of course, this begs the question – after embedding, does each tensor / emb_data ‘column’ still correspond to an initial value (APPLE), or is it just an n-dim encoding for all of FRUITS?