Lesson 5 - Extrapolation - How does the column is_valid help?

(KNH) #1

Hello experts,

In Lesson 5, under the topic of extratpolation(video starting:49:23), the professor states the following:

So in this case, what I wanted to do was to first of all figure out what’s the difference between our validation set and our training set. If I understand the difference between our validation set and our training set, then that tells me what are the predictors which have a strong temporal component and therefore they may be irrelevant by the time I get to the future time period. So I do something really interesting which is I create a random forest where my dependent variable is “is it in the validation set” (is_valid). I’ve gone back and I’ve got my whole data frame with the training and validation all together and I’ve created a new column called is_valid which I’ve set to one and then for all of the stuff in the training set, I set it to zero. So I’ve got a new column which is just is this in the validation set or not and then I’m going to use that as my dependent variable and build a random forest. This is a random forest not to predict price but predict is this in the validation set or not. So if your variable were not time dependent, then it shouldn’t be possible to figure out if something is in the validation set or not.

Now i am really confused about this.
How does having the column is_valid help us in understanding if there are any time dependent predictors?
Because thinking about this deeply, in the training set, the value of is_valid is all 0. So the RF algorithm only sees a pattern of 0 for all observations. So its understood that the RF algorithm has no way of predicting is_valid in the validation set since it hasn’t really seen a value of 1 for is_valid in any of the rows during its training.

Kindly elaborate.

Kiran Hegde

1 Like


df_ext = df_keep.copy()
df_ext[‘is_valid’] = 1
df_ext.is_valid[:n_trn] = 0
x, y, nas = proc_df(df_ext, ‘is_valid’)
m = RandomForestClassifier(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(x, y);

The job of the RF is to figure out if observation i belongs has a ‘is_valid’ value of 0 or 1. If the validation set would be picked randomly, then indeed no features of row i could help. However, here the validation set is picked as the last n_trn rows of the data set, so some time varrying features are preserved. For example, Jeremy shows that SalesID and Machine ID are good predictors for what is in the validation set (likely it’s just a counter the business increments over time).