Kaggle Titanic and DecisionTreeRegressor

johan_g · September 28, 2020, 7:27am

I’m trying to do the Kaggle Titanic competition using a DecisionTreeRegressor as in lesson 9.
I’m doing fine (I guess at least) until I want to use the model to get the predictions on the test set.

I prepared the test set this way:
df_test = pd.read_csv(path/'test.csv', low_memory=False)
to_test = TabularPandas(df_test, procs, cat, cont)

I finally try to predict using my trained DecisionTreeRegressor model:
m.predict(to_test.train.xs)

I get the following error:
ValueError: Number of features of the model must match the input. Model n_features is 12 and input n_features is 13

I understand there is no column Survived in the test set, as it’s what I want my model to predict.
I’m also not sure if I use the TabularPandas the right way, as I’m using to_test.train.xs, even so it’s not a training set.
Can anyone please tell me how to solve it? Thanks in advance!

muellerzr · September 28, 2020, 1:59pm

to_test should be made based off of the training set, which you’re not doing here. Currently I have a PR in that should make this easier in the future but what you need to do is something akin to either:

dls = to.dataloaders()
test_dl = dls.test_dl(df_test)
m.predict(test_dl.dataset.xs)

Or:

to_test = to.valid.new(df_test)
m.predict(to_test.xs)

The first is a scenario where you don’t have access to the training data, IE learn.export or torch.save(dls) and then loading them back in.

The second is when it’s in the same notebook/instance

johan_g · September 28, 2020, 3:42pm

Thanks for the reply @muellerzr

I tried the approach above.
When I tried to predict, I got the following error:

KeyError: "['Age_na'] not in index"

Here is the major part of the code if it helps:

df = pd.read_csv(path/'train.csv', low_memory=False)

procs = [Categorify, FillMissing]

cont, cat = cont_cat_split(df, 1, dep_var=dep_var)

to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)

m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(to.train.xs, to.train.y)

df_test = pd.read_csv(path/'test.csv', low_memory=False)

to_test = to.valid.new(df_test)

m.predict(to_test.xs)

KeyError: "['Age_na'] not in index"

But checking to.valid.items.columns (so the original TabularPanda object) tells me that it has a ['Age_na'] column.

Any idea?

Thanks again

maxim · October 26, 2020, 7:41pm

I have the same issue with House Price prediction Kaggle.
I’m trying to get around this by adding 1 fake row with np.NaN values where at least 1 value is NaN in the whole training+test set and the rest values are modes of the training set.
df_all = df.append(df_test)
display_all(df_all.isnull().sum().sort_index()/len(df_all))

dep_var = ‘SalePrice’

nas = df_all.isnull().sum()
nas[dep_var] = 0
nas

df_mode = df.mode(axis=0,dropna=False)
df_mode

fake_nan_row = np.where(nas == 0, df_mode, np.NaN)
pd.DataFrame(fake_nan_row)

Tried adding the fake_nan_row to training data, but when doing
to_test = to.valid.new(df_test)
m.predict(to_test.xs)
it shows KeyError: “[‘MasVnrArea_na’, ‘BsmtFinSF2_na’, ‘BsmtFinSF1_na’, ‘BsmtUnfSF_na’, ‘LotFrontage_na’, ‘GarageArea_na’, ‘GarageYrBlt_na’, ‘TotalBsmtSF_na’] not in index”
even if I add the fake_nan_row to the test dataset.

Edit:

It turns out that sklearn doesn’t keep the columns dropped during feature importance step, so we have to drop the same columns as in previous steps:

dls = to.dataloaders()
test_dl = dls.test_dl(df_test)
m.predict(test_dl.dataset.xs[to_keep].drop(to_drop, axis=1))

Adding just 1 row with NaNs to training data works, though if I remove it, I get an error:
AssertionError: nan values in BsmtFinSF1 but not in setup training set

Also, if I run without the fake NaN row in training:

to_test = to.valid.new(df_test)
m.predict(to_test.xs[to_keep].drop(to_drop, axis=1))

I get an error:
KeyError: “[‘MasVnrArea_na’, ‘LotFrontage_na’, ‘GarageYrBlt_na’] not in index”
Which is not related to sklearn, but happens just by calling to_test.xs
Which means that for some reason,
to_test = to.valid.new(df_test)
doesn’t work properly with autogenerated NaN columns.

So, my guess is that I have to add a fake row with NaN values to my training data to have all of the possible columns available for the Random Forest.
I don’t know if there is a more elegant way to do it with dataloaders automatically.