proc_df() is a function from fastai/structured.py, it can:
- replaces categories with their numeric codes
- handle missing continuous values
- split the dependent variable into a separate variable
However, the way it deals with missing continuous values is dependent on the data within the function. For example, if we set the subsample size to 20, proc_df will only transform df_raw to df_trn based on only the 20 data sample it has:
df_trn, y_trn = proc_df(df_raw, 'SalePrice', subset=20)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)
When running the random forest and compares the RMSE score,
m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)
print_score(m)
one will probably find the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-101-5a299083f051> in <module>()
1 m = RandomForestRegressor(n_jobs=-1)
2 get_ipython().magic('time m.fit(X_train, y_train)')
----> 3 print_score(m)
<ipython-input-28-284f16c2cedf> in print_score(m)
2
3 def print_score(m):
----> 4 res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
5 m.score(X_train, y_train), m.score(X_valid, y_valid)]
6 if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
~/anaconda3/envs/fastai/lib/python3.6/site-packages/sklearn/ensemble/forest.py in predict(self, X)
679 check_is_fitted(self, 'estimators_')
680 # Check data
--> 681 X = self._validate_X_predict(X)
682
683 # Assign chunk of trees to jobs
~/anaconda3/envs/fastai/lib/python3.6/site-packages/sklearn/ensemble/forest.py in _validate_X_predict(self, X)
355 "call `fit` before exploiting the model.")
356
--> 357 return self.estimators_[0]._validate_X_predict(X, check_input=True)
358
359 @property
~/anaconda3/envs/fastai/lib/python3.6/site-packages/sklearn/tree/tree.py in _validate_X_predict(self, X, check_input)
382 "match the input. Model n_features is %s and "
383 "input n_features is %s "
--> 384 % (self.n_features_, n_features))
385
386 return X
ValueError: Number of features of the model must match the input. Model n_features is 65 and input n_features is 66
This is because the number of variables between df_trn and df (where df, y = proc_df(df_raw, ‘SalePrice’), the dataframe transformed via all training data) do not match:
print(df_trn.shape, df.shape)
(20, 65) (401125, 66)
What I did was to find out which variable(s) df_trn is missing,
print([n for n in df.columns if n not in df_trn.columns])
[‘auctioneerID_na’]
And set the appropriate values accordingly:
df_trn['auctioneerID_na']=False
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)
m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)
print_score(m)
[0.34220521625855504, 0.6633520271867647, 0.81138099286902421, 0.21415644568550077]
This issue is not particularly important for the bulldozer data as any subsample size greater than 100 seems to work fine, but it might become annoying for some other datasets from other Kaggle competitions.
@jeremy please let us know if you have any other genetic ways to deal with this issue.