Problem with cat, cont variable in House Price kaggle competiton

I am participating in House Prices - Advanced Regression Techniques competition. By following the instructions in Chapter 9 of the book, I created Tabular pandas block like that:

cont, cat=cont_cat_split(df, 1, dep_var=dep_var)
procs=[Categorify, FillMissing, Normalize]
splits=RandomSplitter()(range_of(df))

to=TabularPandas(df, procs, cat, cont, y_names=dep_var, y_block=RegressionBlock(), splits=splits)

Then created DecisionTreeClassifier with minimum 25 leaves

m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(xs, y)

But, when trying to predict from test set by calling these:

to_test=TabularPandas(tst_df, procs=procs, cat_names=cat, cont_names=cont)
tst_xs = to_test.train.xs

def subm(preds, suff):
    tst_df['SalePrice'] = preds
    sub_df = tst_df[['Id','SalePrice']]
    sub_df.to_csv(f'sub-{suff}.csv', index=False)
subm(m.predict(tst_xs), 'dt')

I got an error:

Feature names unseen at fit time:
- BsmtFinSF1_na
- BsmtFinSF2_na
- BsmtFullBath_na
- BsmtHalfBath_na
- BsmtUnfSF_na
- ...
Feature names must be in the same order as they were in fit.

X has 91 features, but DecisionTreeRegressor is expecting 83 features as input.

My notebook is visible to anyone, can you help me with that?

I haven’t used DecisionTreeClassifier yet, so I can only offer general advice. The first of which is… you’ve not presented any investigation into your data, so I presume you’ve not done that - but its hard to understand/debug something if you can’t “see” how the data flows and changes through your program, so I’ll run you through that.

Running your notebook, I get a slightly different error i.e. 90 instead of 91.

Then examining each variable I find a hint with some numbers matching those in my error…
image

The code setting those variables is far apart and hard for me to compare, so
rearranging the code make it easeir to focus on comparison…

With it now obvious that cells 21 and 22 were very similar, I struggled to understand how they could end up with a different number of columns. So digging deeper…

First thing, the cat variable being modified by the TabularPandas initializer suprised me (i.e. violating Priniciple of Least Suprise). I would expect the input list to be copied into the object rather than stored by reference such that the outer reference get modified - but that might just be my unfamiliarity with Python conventions, so I’ll leave that as a side-bar.

Next interesting thing is seeing the difference in columns follow a distinctive pattern ending in “_na”. Browsing around the TabularPandas GitHub source code I found…

From which I infer the proc FillMissing is the culprit, due to the training and test datasets having missing values in different places, so different additional columns.

I imagine a few ways to address this:

  1. Don’t use FillMissing - but then you lose the value of that additional information

  2. Reduce your test data columns to match the columns used by the training data.
    image
    Where “83” is just the right number of input feature expected by my DecisionTreeRegressor. But the values missing from the test-set may adversely affect your predictions.

  3. Expand the columns used to be a superset of both training and test data. Something like this…
    a. Add test column to both df & tst_df to identify their rows i.e. using false & true respectively

    b. Concatenate df & tst_df

    c. Use TabularPandas to fill in missing values with additional columns

    d. Extract separate training and test datasets based on filterig the test column.

    I don’t know if that will provide any additional value, since the training data won’t have learnt any correlation about missing values it hasn’t seen. Perhaps you need to train a model to predict and fill in the missing values in the test data, although that might violate the principal of separating the data.

btw, something related… Imputing missing values before building an estimator — scikit-learn 1.1.2 documentation

hope this helps.

5 Likes

Thank you. It helped

1 Like