I am trying to use the entity embedding architecture for the bull dozers dataset (https://www.kaggle.com/c/bluebook-for-bulldozers/data).
When I run the following:
procs=[FillMissing, Categorify, Normalize]
data = (TabularList.from_df(df, path=PATH, cat_names=cat_flds, cont_names=cont_flds, procs=procs,) .split_by_idx(val_idx) .label_from_df(cols=dep, label_cls=FloatList, log=False) .databunch())
I get the error:
AssertionError: You have NaN values in column(s) SalePrice of your dataframe, please fix it.
I thought that
FillMissing will take care of
NaNs. Is that not so?
What do you have declared as cat, cont, and your dependent variable? If sales_price is your dependent variable you need to clean and remove those data points, as it makes no sense to have no y values
(We also can’t do much without seeing the complete code, this is all speculation)
In the notebook here (https://github.com/fastai/fastai/blob/master/courses/ml1/bulldozer_dl.ipynb), we can see that the number of features being used to train the model is lesser than the actual number in the dataset. I am not able to find any notebook which shows how we chose to exclude some of the features. Can someone please help?
In Jeremy’s intro to ML course he discusses feature importance. It’s been covered in the forums too, and I’ll be covering it in a few weeks in my study group. Look at “permutation importance”