Data.x.cat_names for tabular DataBunch includes columns other than those specified

wgpubs · December 2, 2018, 6:39am

Is this supposed to work as such?

cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
    'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
    'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
    'SchoolHoliday_fw', 'SchoolHoliday_bw']

but after running this …

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                   .split_by_idx(valid_idx)
                   .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
                   )

data.x.cat_names

I see the following:

['Store',
 'DayOfWeek',
 'Year',
 'Month',
 'Day',
 'StateHoliday',
 'CompetitionMonthsOpen',
 'Promo2Weeks',
 'StoreType',
 'Assortment',
 'PromoInterval',
 'CompetitionOpenSinceYear',
 'Promo2SinceYear',
 'State',
 'Week',
 'Events',
 'Promo_fw',
 'Promo_bw',
 'StateHoliday_fw',
 'StateHoliday_bw',
 'SchoolHoliday_fw',
 'SchoolHoliday_bw',
 'CompetitionDistance_na',
 'CloudCover_na']

The last two items aren’t in the cat_vars list so this was unexpected.

wgpubs · December 2, 2018, 7:05am

Even stranger …

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                   .split_by_idx(valid_idx)
                   .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
                   .databunch())

… actually changes the local cat_vars variable so that it now includes 'CompetitionDistance_na', 'CloudCover_na'.

edwardjross · December 2, 2018, 7:17am

This is a side effect of the FillMissing proc; marking na values of continuous variables that have them before replacing them with missing (because sometimes the fact they are missing may be predictive in itself).

Check out the source code of FillMissing for the details:

                    df.loc[:,name+'_na'] = pd.isnull(df.loc[:,name])
                    if name+'_na' not in self.cat_names: self.cat_names.append(name+'_na')

wgpubs · December 2, 2018, 7:27am

Thanks for citing the code where this is happening.

Was just unexpected … your local variables generally aren’t updated by whatever you are passing them into as arguments.

Two questions:

I see FillMissing specifies a add_col argument. If set to True, I’m assuming it not add the column
Following up on your completely valid statement, “because sometimes the fact they are missing may be predictive in itself” … what if you discover that it is predictive to include some of the “_na” columns but exclude others? Can you pass multiple FillMissing pre-processors to get that behavior?

soco_loco · December 2, 2018, 10:59am

I see FillMissing specifies a add_col argument. If set to True , I’m assuming it not add the column

If set to FALSE then it doesn’t create a dummy value column (c_nan: 0 or 1). The dummy value column c_na indicates if it artificially filled in a reference column © with a mathematically created value.

Relevant section is here: " fill_strategy is adopted to replace those nans and if add_col is True, whenever a column c has missing values, a column named c_nan is added and flags the line where the value was missing."

https://docs.fast.ai/tabular.transform.html#FillMissing

ab_ai · October 4, 2019, 3:16am

What advantage does that add_col provide? I’m trying to understand if my model will be more or less tolerant if some data is missing from the data input. I’d like to build a model which tolerates certain data failing to be entered.

muellerzr · October 4, 2019, 3:18am

add_col will generate a categorical binary column based on if a value was present or not, which can be used in our embeddings. Does this help?

ab_ai · October 4, 2019, 3:26am

That’s the part I understand. Right now I am building a model which streams data at 1Hz. It is possible that one feature of this data might return NaN values due to the sensor corresponding to that feature failing.

We are investigating strategies to help our model give good predictions in light of a sensor failure which would return NaN to one (or more) features.

muellerzr · October 4, 2019, 4:10am

Got it! Then yes, as a missing value feature, this is especially important if you have categorical data. For numerical not so much (as it just goes to the mean). So for every variable you’d want to mabye manually input some NaN’s scattered around for every variable so it has a place in the matrix, even though it may not be used, it is there when needed. Then if you get enough scattered you could handle imbalanced samples for specifically missing values.

Does this help you get started thinking?

ab_ai · October 4, 2019, 12:47pm

I think we are on the same page now.

When I read the documentation last night, I thought that the c_nan column & flag would be added when a continuous column. It says in the documentation that cat_names get a value 0 in the TabularDataBunch. It also looks like this is the strategy employed looking at the source code. There it says for all the cont_names, create a new column with _na appended and fill that column with values 0 or 1 depending on if it is missing. Then it fills the missing with the appropriate FillStrategy.

Do you see it the same way?

What is the explanation or implication on embeddings? My memory is fuzzy on that point related to the coursework. I’ll have to go re-watch the embeddings discussion again.