OOB score is being used when we do not have a big dataset and splitting into training and validation set is taking away useful data that can be used to train the models. So we basically decide to use the training data as the validation set by using those samples that were not used for training particular trees.
For each tree we have some rows that were not used to train that particular tree. So when evaluating training data for prediction , for each sample we only consider trees that did not use that sample to train themselves.
Let’s say we have 100 trees and 3k samples in the training set. We’re going to evaluate the model. For evaluation we need the model’s prediction for each sample. We start iterating through the samples.
In general we use all 100 trees to make their prediction and average their predictions, but in this OOB case we only use those trees which did not use this sample for training, so it’s very likely this number is less than 100 as some tree probably have used this sample to train itself.
On average OOB score is showing less generalization than the validation score because we are using less trees to get the predictions for each sample. Recall as the number of trees grow, in general we get better predictive power even if it flattens out in the end, so if we are using fewer trees than available we’re getting slightly less accurate models. But since we’re using all of the training data for training instead of keeping them for validation and getting a validation score at the same time, this trade off is not bad.And the more trees we add the less serious this underestimation is.
At least that’s what seems like to me to be the case after doing the first pass, but correct me if I’m mistaken.