Suppose you have a categorical variable named var1
in a list cat_vars
:
cat_vars = ['var1','var2','var3']
data = TabularDataBunch.from_df(path, df, dep_var,
valid_idx=valid_idx,
procs=procs,
cat_names=cat_vars,
cont_names=cont_vars,
bs=128)
learn = tabular_learner(data, layers=[1000,500],
emb_szs=emb_szs,
ps=[0.05,0.05], emb_drop=0.05, metrics=rmse)
Then you can access the categories encoded in the corresponding embedding matrix as follows:
learn.data.classes['var1']
This command returns:
array(['#na#', '135-04003-401', '135-04005-001', '145-20231-001', ..., 'V57466215 000 00', 'V57466216 200 00',
'V57466248 200 00', 'V57466517 001 00'], dtype=object)
As you can see in the above example the category for the ‘unknown’ class is at index 0 ('#na#'
).
Note that in my experience this works only for regression tasks. In a classification task this command returns the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-123-ea9f310c586e> in <module>
----> 1 learn.data.classes['PN']
TypeError: list indices must be integers or slices, not str
Therefore there seems to be an inconsistency in the source code worth checking by the developers.