Set_rf_samples and OOB score - Why are they not compatible

kiranh · November 8, 2018, 7:07am

Hello there,

The professor states in Lesson 2 that the function: set_rf_samples and OOB score are not compatible.
I am a bit confused.

I fail to understand as to why they are not compatible. As i understand, the OOB score is calculated for a row in the sample by making use of trees where this row as not used during training.
With subsampling, each tree will get a different subset of rows out of the total dataset. If thats the case, then each row in the dataset will have been only used in one tree and hence all of the other trees can be used to predict the dependent variable for this row.

Shouldn’t OOB score still be valid then?

Kindly clarify

Regards,
Kiran Hegde

dcooper01 · November 9, 2018, 6:32pm

Hi Kiran,

My understanding was that the implementation in sklearn was not compatible with the “set_rf_samples” function written in the 0.7 fastai library used in the course. I don’t believe there is anything from a theoretical perspective that would prevent you from using both concepts… just the code implementation.

Oh, and one tiny little correction on your post:
“With subsampling, each tree will get a different subset of rows out of the total dataset. If thats the case, then each row in the dataset will have been only used in one tree and hence all of the other trees can be used to predict the dependent variable for this row.”

This isn’t actually true (most likely), but I get what you’re saying. I would expect each row to be used in multiple trees/estimators… because the sampling is done with replacement. That said, it doesn’t mean you can’t make predictions using only trees that were not trained on those rows, so you’re right on the concept, just off on the details a tiny bit I think.

Thanks!

Daniel

kiranh · November 12, 2018, 5:53am

Thanks @dcooper01 for taking the time to respond to my question.
Much appreciated.

Stefania · March 13, 2019, 8:54am

Hi,
I also have some doubts about oob and set_rf_samples:

with set_rf_samples(n) every tree will have n random rows from the dataframe, not the same set of rows, right?
the oob score is calculated over all the rows of the train dataset, right? So if I set a n too low in set_rf_samples(n) I can’t be sure that each rows of the dataframe is in at least one tree of the random forest. If this is the case I receive an error, otherwise the oob is calculated correctly?
If I don’t use set_rf_samples(n), how many rows there will be in each tree considering that in the RandomForestRegressor the default value od the parameter “bootstrap” is True? With bootstrap each row is in at least one tree?

Thank you

Stefania

krithi07 · May 8, 2020, 6:56am

Hi @Stefania,

That’s correct, every tree will have n random rows from the entire training set and not the same rows. As the docstring of the function mentions - Changes Scikit learn’s random forests to give each tree a random sample of n random rows.
The oob_score is not compatible with set_rf_samples function and hence even if you do get some value of oob_score and not the error, you must ignore it.
Well, you can’t be sure if each row is in at least one tree. However, by default, for RandomForestRegressor, there will be the same number of rows in each tree as there is in the training set, and these rows, for each tree, will be picked randomly with replacement from the training set. This effectively amounts for ~63% of training set representation in each tree.