Tabular pandas not able to replace missing values

riteshpaul · November 25, 2020, 9:38am

I am using the below code -
to = TabularPandas(train, procs, cat, cont, y_names=dep_var, splits=splits)

so when I do - to.show(3), I get -

and to see the underlying items - to.items.head(3) -

Shouldn’t the missing values NaNs have been replaced by the medians or something?

This causes an error when I run the regressor -

Can someone help on this error?

johannesstutz · November 25, 2020, 10:08am

First guess: did you include FillMissing in the procs?

If yes: I don’t know. It’s probably easier if you post the complete code or even better, a link to the notebook.

riteshpaul · November 25, 2020, 10:26am

procs = [Categorify, FillMissing]

cond = ((train[‘Year’]<2011) & (train[‘Month’]<11))
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]

splits = (list(train_idx),list(valid_idx))

cont,cat = cont_cat_split(train, 1, dep_var=dep_var)

to = TabularPandas(train, procs, cat, cont, y_names=dep_var, splits=splits)

johannesstutz · November 25, 2020, 10:58am

That’s strange. The NaN values should indeed be replaced by the median, or #nan for categorical columns.
The MarkDown1_na etc columns indicate that some transform has been applied, but I don’t know why it shows this behavior. Maybe someone else can help!

riteshpaul · November 25, 2020, 12:06pm

If it helps, I am running on kaggle where fastai version is ‘2.0.19’

muellerzr · November 25, 2020, 4:01pm

FillMissing works on continuous data, as the documentation states. Categorical variables can absolutely have missing values. They get a special spot in the embedding matrix

riteshpaul · November 25, 2020, 6:19pm

ok got it but when I run the decisiontreeregressor it throws an error because of missing values (screenshot), so how do I handle this using tabular pandas?

muellerzr · November 25, 2020, 6:32pm

You should investigate your cat and cont variable names to make sure that they are how you would expect them to be. Perhaps something that should be categorical is being thrown into the continuous

(Note that this is a helper function, but that doesn’t mean it works on every single scenario)

riteshpaul · November 26, 2020, 6:08am

The only columns with missing values are Markdown1-5 and when I list the cont and cat variables, I see Markdown1-5 columns appear in cont variable so it is continuous. But yet Tabularpandas is unable to replace these missing values. Not sure what I am missing here. If its not too much trouble, you can check the notebook -
https://www.kaggle.com/ritepaul/rp-walmart-sales-forcast

muellerzr · November 26, 2020, 6:16am

There is another option at play here then. What % of those columns are NA? Is it 100%?

riteshpaul · November 26, 2020, 6:47am

Here is the cont variable -
[‘Store’,
‘Dept’,
‘Temperature’,
‘Fuel_Price’,
‘MarkDown1’,
‘MarkDown2’,
‘MarkDown3’,
‘MarkDown4’,
‘MarkDown5’,
‘CPI’,
‘Unemployment’,
‘Size’,
‘Year’,
‘Month’,
‘Day’,
‘Dayofyear’]

Out of which Markdown1-5 have NaNs (not all values though)
So out of total 16 continuous variables, 5 have NaNs - 31%

And also % of NaNs in each of these Mardown cols -
Markdown1 - 64.2%
Markdown2 - 73.6%
Markdown3 - 67.4%
Markdown4 - 67.9%
Markdown5 - 64%