FillMissing but add the na as continuous columns

mraggi · June 6, 2021, 3:00am

Hi there.

I don’t know why FillMissing adds the column_na as categorical variables, but I’m trying to add them as continuous variables instead (0 and 1. I don’t need an embedding of size 3 for them…).

So I just changed the FillMissing.encodes to be

def encodes(self, to):
    missing = pd.isnull(to.conts)
    for n in missing.any()[missing.any()].keys():
        assert n in self.na_dict, f"nan values in `{n}` but not in setup training set"
    for n in self.na_dict.keys():
        to[n].fillna(self.na_dict[n], inplace=True)
        if self.add_col:
            to.loc[:,n+'_na'] = missing[n]
            if n+'_na' not in to.cont_names: to.cont_names.append(n+'_na')

(so I just changed cat_names to cont_names in the last line).

This works nicely when training, but when predicting, as in

learn.dls.test_dl(test_df)

it complains about

KeyError: ‘column_na’ not in index

so I’m guessing there is some part of the fastai code that when a column with na is missing from the test dataframe, it adds it, but I can’t find it.