Kaggle Titanic and DecisionTreeRegressor

I’m trying to do the Kaggle Titanic competition using a DecisionTreeRegressor as in lesson 9.
I’m doing fine (I guess at least) until I want to use the model to get the predictions on the test set.

I prepared the test set this way:
df_test = pd.read_csv(path/'test.csv', low_memory=False)
to_test = TabularPandas(df_test, procs, cat, cont)

I finally try to predict using my trained DecisionTreeRegressor model:

I get the following error:
ValueError: Number of features of the model must match the input. Model n_features is 12 and input n_features is 13

I understand there is no column Survived in the test set, as it’s what I want my model to predict.
I’m also not sure if I use the TabularPandas the right way, as I’m using to_test.train.xs, even so it’s not a training set.
Can anyone please tell me how to solve it? Thanks in advance!

to_test should be made based off of the training set, which you’re not doing here. Currently I have a PR in that should make this easier in the future but what you need to do is something akin to either:

dls = to.dataloaders()
test_dl = dls.test_dl(df_test)


to_test = to.valid.new(df_test)

The first is a scenario where you don’t have access to the training data, IE learn.export or torch.save(dls) and then loading them back in.

The second is when it’s in the same notebook/instance

Thanks for the reply @muellerzr

I tried the approach above.
When I tried to predict, I got the following error:

KeyError: "['Age_na'] not in index"

Here is the major part of the code if it helps:

df = pd.read_csv(path/'train.csv', low_memory=False)

procs = [Categorify, FillMissing]

cont, cat = cont_cat_split(df, 1, dep_var=dep_var)

to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)

m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(to.train.xs, to.train.y)

df_test = pd.read_csv(path/'test.csv', low_memory=False)

to_test = to.valid.new(df_test)


KeyError: "['Age_na'] not in index"

But checking to.valid.items.columns (so the original TabularPanda object) tells me that it has a ['Age_na'] column.

Any idea?

Thanks again

I have the same issue with House Price prediction Kaggle.
I’m trying to get around this by adding 1 fake row with np.NaN values where at least 1 value is NaN in the whole training+test set and the rest values are modes of the training set.
df_all = df.append(df_test)

dep_var = ‘SalePrice’

nas = df_all.isnull().sum()
nas[dep_var] = 0

df_mode = df.mode(axis=0,dropna=False)

fake_nan_row = np.where(nas == 0, df_mode, np.NaN)

Tried adding the fake_nan_row to training data, but when doing
to_test = to.valid.new(df_test)
it shows KeyError: “[‘MasVnrArea_na’, ‘BsmtFinSF2_na’, ‘BsmtFinSF1_na’, ‘BsmtUnfSF_na’, ‘LotFrontage_na’, ‘GarageArea_na’, ‘GarageYrBlt_na’, ‘TotalBsmtSF_na’] not in index”
even if I add the fake_nan_row to the test dataset.


It turns out that sklearn doesn’t keep the columns dropped during feature importance step, so we have to drop the same columns as in previous steps:

dls = to.dataloaders()
test_dl = dls.test_dl(df_test)
m.predict(test_dl.dataset.xs[to_keep].drop(to_drop, axis=1))

Adding just 1 row with NaNs to training data works, though if I remove it, I get an error:
AssertionError: nan values in BsmtFinSF1 but not in setup training set

Also, if I run without the fake NaN row in training:

to_test = to.valid.new(df_test)
m.predict(to_test.xs[to_keep].drop(to_drop, axis=1))

I get an error:
KeyError: “[‘MasVnrArea_na’, ‘LotFrontage_na’, ‘GarageYrBlt_na’] not in index”
Which is not related to sklearn, but happens just by calling to_test.xs
Which means that for some reason,
to_test = to.valid.new(df_test)
doesn’t work properly with autogenerated NaN columns.

So, my guess is that I have to add a fake row with NaN values to my training data to have all of the possible columns available for the Random Forest.
I don’t know if there is a more elegant way to do it with dataloaders automatically.