Can anyone explain why oob score is on average less good than the validation score?
I asked Jeremy after class and I thought I got it, but now I’m feeling lost again.
Here is my understanding:
we have a big data set, and we split it into training set and validation set.
if we decide to have 100 trees in total, for the validation score, we would have randomly 100 decision trees aggregated – an ensemble. And we use this model to get our prediction accuracy on validation set, which is validation score.
And for oob score, we also have 100 trees. But they are both trained and evaluated on training set. In other words, we don’t use the validation set at all.
Jeremy mentioned something about subset in oob but I didn’t see where it is.
Any help is appreciated!
Can i reframe your question about how the oob_score 's sample and the validation set sample is created and illustrate how the sample set formation can affect the scores?
The oob_score uses a sample of “left-over” data that wasn’t necessarily used during the model’s analysis, and the validation set is sample of data you yourself decided to subset. in this way, the oob sample is a little more random than the validation set. Therefore, the oob sample (on which the oob_score is measured) may be “harder” that the validation set. The oob_score may on average have a “less good” accuracy score as a consequence.
For example, Jeremy and Terence use only the last 2 weeks of grocery store data as a validation set. The oob sample may have unused data from across all four years of sales data. The oob_score 's sample is much harder because it’s more randomized and has more variance.
If the oob_score never improves, but the validation set score is always very good. You need to re-think how to subset the validation set. In the case of Jeremy and Terence, they might decide to take a more random sample of data across all years rather than strictly the last 2 weeks of data.
The main point here is that oob score is calculated by using the trees in the ensemble that doesn’t have those specific data points, so with oob you are not using the full ensemble.
Where as with a validation set you use your full forest to calculate the score. And in general a full forest is better than a subsets of a forest.
Don’t use validation and oob together when evaluating your model. Its like apples and oranges so pick one tune your forest by the increase or decreases based on either oob or validation.
Oob particularly helps when we can’t afford to hold out a valdation set. For example I am working with patient data which has 257 observations. If I were to leave 20% of this data for validation this might cause some of the relationships in my data to be damaged. But given enough data like groceries it’s feasible to work with a subset.
I hope this helps, so it’s dependent on your case and @jeremy wanted us to understand how oob calculated under the hood. Another good method for very small datasets is LOO cross validation.
I think now I get it. In my oob example, we cannot guarantee the trees in the ensemble exactly equal to 100.
But it could be 100 when no trees include that point of data, right?
For that, 1 point shouldn’t be included during bootstrapping in any trees we are building. You can calculate the probability of it, but having a full oob sample that were not included in any tree is almost impossible that’s why in general we say oob tend to be worse than actual validation score.
This is equivalent of having trees that were build by the exact same set of points.
n = 10
subsample_size = 10000
Let’s say we build the first tree, then Pr(having same oob set for other 9) = (1 / 10000) ** 10000 -> (Probability of picking the exact same points 10000 times)
this is just for one tree
for all 9 trees
(1 / 10000) ** 10000**9
So yeah, I am pretty sure oob will not include some trees
OOB score is being used when we do not have a big dataset and splitting into training and validation set is taking away useful data that can be used to train the models. So we basically decide to use the training data as the validation set by using those samples that were not used for training particular trees.
For each tree we have some rows that were not used to train that particular tree. So when evaluating training data for prediction , for each sample we only consider trees that did not use that sample to train themselves.
Let’s say we have 100 trees and 3k samples in the training set. We’re going to evaluate the model. For evaluation we need the model’s prediction for each sample. We start iterating through the samples.
In general we use all 100 trees to make their prediction and average their predictions, but in this OOB case we only use those trees which did not use this sample for training, so it’s very likely this number is less than 100 as some tree probably have used this sample to train itself.
On average OOB score is showing less generalization than the validation score because we are using less trees to get the predictions for each sample. Recall as the number of trees grow, in general we get better predictive power even if it flattens out in the end, so if we are using fewer trees than available we’re getting slightly less accurate models. But since we’re using all of the training data for training instead of keeping them for validation and getting a validation score at the same time, this trade off is not bad.And the more trees we add the less serious this underestimation is.
At least that’s what seems like to me to be the case after doing the first pass, but correct me if I’m mistaken.
You are not mistaken, @mayeesha. To add to your explanation: suppose our data set has N examples (rows). Each tree in our random forest contains a bootstrap sample, which means a set of N samples randomly chosen (with replacement) from the data set. “With replacement” means that each random sample is chosen from the full data set (i.e. before choosing the next sample, we put back the sample we just chose). Now the probability that a particular sample will not be chosen in a single random draw from the full data set is \frac{N-1}{N}. So the probability that a sample will not be chosen in a tree – which is a bootstrap sample – is (\frac{N-1}{N})^{N} = (1-\frac{1}{N})^{N}. In the limit of large N, this expression asymptotically approachese^{-1} \approx 0.368. So for each tree, 36.8% of the samples in the full data set are “out-of-bag”, i.e. left out, and therefore can be used for prediction.