Perform proc_df on a small subsample

proc_df() is a function from fastai/, it can:

  • replaces categories with their numeric codes
  • handle missing continuous values
  • split the dependent variable into a separate variable

However, the way it deals with missing continuous values is dependent on the data within the function. For example, if we set the subsample size to 20, proc_df will only transform df_raw to df_trn based on only the 20 data sample it has:

df_trn, y_trn = proc_df(df_raw, 'SalePrice', subset=20)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

When running the random forest and compares the RMSE score,

m = RandomForestRegressor(n_jobs=-1)
%time, y_train)

one will probably find the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-101-5a299083f051> in <module>()
      1 m = RandomForestRegressor(n_jobs=-1)
      2 get_ipython().magic('time, y_train)')
----> 3 print_score(m)

<ipython-input-28-284f16c2cedf> in print_score(m)
      3 def print_score(m):
----> 4     res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
      5                 m.score(X_train, y_train), m.score(X_valid, y_valid)]
      6     if hasattr(m, 'oob_score_'): res.append(m.oob_score_)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/sklearn/ensemble/ in predict(self, X)
    679         check_is_fitted(self, 'estimators_')
    680         # Check data
--> 681         X = self._validate_X_predict(X)
    683         # Assign chunk of trees to jobs

~/anaconda3/envs/fastai/lib/python3.6/site-packages/sklearn/ensemble/ in _validate_X_predict(self, X)
    355                                  "call `fit` before exploiting the model.")
--> 357         return self.estimators_[0]._validate_X_predict(X, check_input=True)
    359     @property

~/anaconda3/envs/fastai/lib/python3.6/site-packages/sklearn/tree/ in _validate_X_predict(self, X, check_input)
    382                              "match the input. Model n_features is %s and "
    383                              "input n_features is %s "
--> 384                              % (self.n_features_, n_features))
    386         return X

ValueError: Number of features of the model must match the input. Model n_features is 65 and input n_features is 66 

This is because the number of variables between df_trn and df (where df, y = proc_df(df_raw, ‘SalePrice’), the dataframe transformed via all training data) do not match:

print(df_trn.shape, df.shape)

(20, 65) (401125, 66)

What I did was to find out which variable(s) df_trn is missing,

print([n for n in df.columns if n not in df_trn.columns])


And set the appropriate values accordingly:

X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)
m = RandomForestRegressor(n_jobs=-1)
%time, y_train)

[0.34220521625855504, 0.6633520271867647, 0.81138099286902421, 0.21415644568550077]

This issue is not particularly important for the bulldozer data as any subsample size greater than 100 seems to work fine, but it might become annoying for some other datasets from other Kaggle competitions.

@jeremy please let us know if you have any other genetic ways to deal with this issue.


I don’t currently have a way, but I agree it needs to be fixed! Very happy to hear any thoughts for a good API for this - which also needs to resolve the issue that the median values should be fixed across different datasets too…

OK this is all fixed now (I think!) Do a git pull and check out the ‘Speeding Things Up’ section of lesson 1 to see it in action.


Thank you Jeremy, the notebook seems to work as expected. Please update the section ‘subsampling’ where the code needs to be modified accordingly:

df_trn, y_trn, nas= proc_df(df_raw, 'SalePrice')
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)

otherwise there will be an error like this:

ValueError                                Traceback (most recent call last)
<ipython-input-74-8436ff698903> in <module>()
----> 1 df_trn, y_trn= proc_df(df_raw, 'SalePrice')
      2 X_train, X_valid = split_vals(df_trn, n_trn)
      3 y_train, y_valid = split_vals(y_trn, n_trn)

ValueError: too many values to unpack (expected 2)

Thank you for fixing this!

1 Like