I have the same issue with House Price prediction Kaggle.
I’m trying to get around this by adding 1 fake row with np.NaN values where at least 1 value is NaN in the whole training+test set and the rest values are modes of the training set.
df_all = df.append(df_test)
display_all(df_all.isnull().sum().sort_index()/len(df_all))
dep_var = ‘SalePrice’
nas = df_all.isnull().sum()
nas[dep_var] = 0
nas
df_mode = df.mode(axis=0,dropna=False)
df_mode
fake_nan_row = np.where(nas == 0, df_mode, np.NaN)
pd.DataFrame(fake_nan_row)
Tried adding the fake_nan_row to training data, but when doing
to_test = to.valid.new(df_test)
m.predict(to_test.xs)
it shows KeyError: “[‘MasVnrArea_na’, ‘BsmtFinSF2_na’, ‘BsmtFinSF1_na’, ‘BsmtUnfSF_na’, ‘LotFrontage_na’, ‘GarageArea_na’, ‘GarageYrBlt_na’, ‘TotalBsmtSF_na’] not in index”
even if I add the fake_nan_row to the test dataset.
Edit:
It turns out that sklearn doesn’t keep the columns dropped during feature importance step, so we have to drop the same columns as in previous steps:
dls = to.dataloaders()
test_dl = dls.test_dl(df_test)
m.predict(test_dl.dataset.xs[to_keep].drop(to_drop, axis=1))
Adding just 1 row with NaNs to training data works, though if I remove it, I get an error:
AssertionError: nan values in BsmtFinSF1
but not in setup training set
Also, if I run without the fake NaN row in training:
to_test = to.valid.new(df_test)
m.predict(to_test.xs[to_keep].drop(to_drop, axis=1))
I get an error:
KeyError: “[‘MasVnrArea_na’, ‘LotFrontage_na’, ‘GarageYrBlt_na’] not in index”
Which is not related to sklearn, but happens just by calling to_test.xs
Which means that for some reason,
to_test = to.valid.new(df_test)
doesn’t work properly with autogenerated NaN columns.
So, my guess is that I have to add a fake row with NaN values to my training data to have all of the possible columns available for the Random Forest.
I don’t know if there is a more elegant way to do it with dataloaders automatically.