I’m applying what I learned about random forests to a Kaggle competition. Everything is going smoothly until I try to predict the values on my test set:
# load my training data, train categories, and process data
df_raw = pd.read_csv(f'{PATH}train.csv', low_memory=False)
train_cats(df_raw)
df, y, nas = proc_df(df_raw, 'SalePrice')
# train my data
m = RandomForestRegressor(...)
m.fit(df, y)
#load the test data
df_test = pd.read_csv(f'{PATH}test.csv', low_memory=False)
# apply the category changes
apply_cats(df_test, df_raw)
At this point, I want to run the model’s predict function on my test data.
However, if I pass in df_test now, it gives me an error because I haven’t replaced my category values with integers yet. To do this, I need to call proc_df, but it requires a field name for the dependent variable (ie. ‘SalesPrice’) and the test data doesn’t have that column. So I add a placeholder column:
However, it has now added a bunch of new columns to my dataframe to mark if the value for various fields are NaN, eg: BsmtFinSF1_na (BsmtFinSF1 is a float). These fields were not added to the training data set so the model complains that there is a mismatch in the number of columns.
Does anyone know what I am doing wrong? Am I approaching this correctly? or is there a better way to prepare the test dataframe for predictions?
Question about Lesson 2, OOB score. If we randomly sample the data use the left over data as validation set, don’t we run into temporal issues i.e) when the data is times-series data validation set needs to be from a later time period ?
Actually the validation set is the same for all the models Jeremy had built and it’s seperated from the dataset way before…
OOB Score uses 2/3 of the dataset always.(I hope I am not wrong here)
Have a look at the cell when Jeremy does split_vals()
This is from the docs… The out-of-bag (OOB) error is the average error for each z_i calculated using predictions from the trees that do not contain z_i in their respective bootstrap sample.
sorry, I should have not used the term validation set here. I was not talking about the validation set created in earlier steps. What I meant was when using the OOB samples to calculated score, the OOB samples are just random and not from a future time.
I got it answered further down in the video - validation score is lower than OOB score and jeremy explains why it is so.
so I guess OOB score is used here with knowledge of this limitation.
It rather depends. The ML class goes at a somewhat gentler pace, but doesn’t show how to build world-class models (the focus is more on process and interpretation, and also more in depth discussion of foundational details). The DL class is more intense, and gets you building state of the art models from lesson 1. Both can be understood on their own, but they both support each other.
I have a question about the proc_df function and how it uses the median for missing values. Does this not introduce leakage into our machine learning procedure? What I mean is: We’re filling in missing values based on the median of the entire data set, which means that the validation set has an influence on the input that goes into estimating a model.
Wouldn’t it be more correct to first split the data set into training and validation set, then estimate the median of every numerical column in the training set, and use those median values to fill the missing values in both the training and validation set?
Thanks for that. But I would like to comment on that. I don’t think proc_df should be used on the test data frame in the way it is used. If there is a value missing in the test set, it should be filled with the median of the values in the training set, because that’s what we’ve based the model on.
As a side note and not meant as a criticism towards your approach: There’s probably some “smarter” way to fill in certain values. For example, we see that “GarageBuilt” is one of the features that has missing values, but instead of just filling with the median, it would probably make more sense to look at the median difference between when a house was built and when the garage was built. In the notebook, the median for garage built is 1980. The funny thing is that a missing GarageBuilt value would still be set to 1980 even if the house itself was built in a later year…
I like your thinking(especially the last few line’s) but the reason why I atleast used proc_df was to make sure that the categorical encoding remains intact otherwise I might have to do the hard work…(and might end up breaking stuff’s)
And yes filling by median always isn’t actually correct always but it works almost always…(we have other options)