Embedding layer has one more row than expected

cgrinaldi · February 26, 2019, 5:51am

I am trying to use the tabular learner to predict continuous values. A number of my columns are categorical, so I’m expecting them to convert to embedding layers.

One of my categorical variables is month. I was expecting to see Embedding (12, 7) in the model, but I’m actually seeing Embedding (13, 7). And a number of other categorical variables seem to have one more row than I’m expecting.

Does anyone know why there is one more row in the embedding layer?

Pak · February 26, 2019, 8:10am

It may be possible that your data has None-values (missing month data for some rows), as I remember. Framework treats None-values as separate value, and you get additional row in Embedding matrix.

cgrinaldi · February 27, 2019, 3:51am

Thanks for the response, @Pak! And that is a good thought. But I went through and double checked and I’m seeing 12 unique values in my month column. Not sure where the 13 is coming from…

cgrinaldi · February 27, 2019, 4:14am

Turns out you were actually right, @Pak! But it wasn’t because of my data. Instead, it looks like the Fastai library is adding a class #na#. Here’s the link if you are interested:

github.com

fastai/fastai/blob/master/fastai/tabular/data.py#L73


    if isinstance(proc, TabularProc): proc(ds.inner_df, test=True)
    else:
        #cat and cont names may have been changed by transform (like Fill_NA)
        proc = proc(ds.cat_names, ds.cont_names)
        proc(ds.inner_df)
        ds.cat_names,ds.cont_names = proc.cat_names,proc.cont_names
        self.procs[i] = proc
self.cat_names,self.cont_names = ds.cat_names,ds.cont_names
if len(ds.cat_names) != 0:
    ds.codes = np.stack([c.cat.codes.values for n,c in ds.inner_df[ds.cat_names].items()], 1).astype(np.int64) + 1
    self.classes = ds.classes = OrderedDict({n:np.concatenate([['#na#'],c.cat.categories.values])
                              for n,c in ds.inner_df[ds.cat_names].items()})
    cat_cols = list(ds.inner_df[ds.cat_names].columns.values)
else: ds.codes,ds.classes,self.classes,cat_cols = None,None,None,[]
if len(ds.cont_names) != 0:
    ds.conts = np.stack([c.astype('float32').values for n,c in ds.inner_df[ds.cont_names].items()], 1)
    cont_cols = list(ds.inner_df[ds.cont_names].columns.values)
else: ds.conts,cont_cols = None,[]
ds.col_names = cat_cols + cont_cols
ds.preprocessed = True

Thanks again for your help!