Set_rf_samples and OOB score - Why are they not compatible


(KNH) #1

Hello there,

The professor states in Lesson 2 that the function: set_rf_samples and OOB score are not compatible.
I am a bit confused.

I fail to understand as to why they are not compatible. As i understand, the OOB score is calculated for a row in the sample by making use of trees where this row as not used during training.
With subsampling, each tree will get a different subset of rows out of the total dataset. If thats the case, then each row in the dataset will have been only used in one tree and hence all of the other trees can be used to predict the dependent variable for this row.

Shouldn’t OOB score still be valid then?

Kindly clarify

Regards,
Kiran Hegde


(Daniel Cooper) #2

Hi Kiran,

My understanding was that the implementation in sklearn was not compatible with the “set_rf_samples” function written in the 0.7 fastai library used in the course. I don’t believe there is anything from a theoretical perspective that would prevent you from using both concepts… just the code implementation.

Oh, and one tiny little correction on your post:
“With subsampling, each tree will get a different subset of rows out of the total dataset. If thats the case, then each row in the dataset will have been only used in one tree and hence all of the other trees can be used to predict the dependent variable for this row.”

This isn’t actually true (most likely), but I get what you’re saying. I would expect each row to be used in multiple trees/estimators… because the sampling is done with replacement. That said, it doesn’t mean you can’t make predictions using only trees that were not trained on those rows, so you’re right on the concept, just off on the details a tiny bit I think.

Thanks!

Daniel


(KNH) #3

Thanks @dcooper01 for taking the time to respond to my question.
Much appreciated.


#5

Hi,
I also have some doubts about oob and set_rf_samples:

  • with set_rf_samples(n) every tree will have n random rows from the dataframe, not the same set of rows, right?
  • the oob score is calculated over all the rows of the train dataset, right? So if I set a n too low in set_rf_samples(n) I can’t be sure that each rows of the dataframe is in at least one tree of the random forest. If this is the case I receive an error, otherwise the oob is calculated correctly?
  • If I don’t use set_rf_samples(n), how many rows there will be in each tree considering that in the RandomForestRegressor the default value od the parameter “bootstrap” is True? With bootstrap each row is in at least one tree?

Thank you

Stefania