OOB score and R2 score - When to use each


I am really confused about what performance metrics should we be using to measure the performance of a Random Forest model.

I thought R2 score was good enough. However, later in lesson 2, a new score called as OOB score is introduced.

Isnt R2 score good enough?
Why do we need OOB score in addition to R2 score?
Are there reasons why one will be used over the other?

Kiran Hegde

Hi Kiran,

R2 is good enough if you have separate test and train data.

OOBscore can be used if you don’t have enough data to split for train and validation set.

For time series I would use R2 on the unseen data from the end of series (so we can simulate predicting future). OOBscore will be inflated for time series as we are predicting random points in time not the future.

Let me know if I missed sth.


1 Like

Hi @kiranh,

The R2 score tells you how successfully your model accounts for the intrinsic variation in the data. What you call the R2 score comes from running the trained model on the Validation data. The OOB score is technically also an R2 score, because it uses the same mathematical formula; the Random Forest calculates it internally using only the Training data. Both scores predict the generalizability of your model – i.e. its expected performance on new, unseen data.

Hello @maciejkpl Thanks for responding back. Aren’t we always supposed to have a testing and training data? Are there cases where we will just have the training data but not the testing data?

> OOBscore can be used if you don’t have enough data to split for train and validation set.
In the fastai github notebook, i see the following mentioned about OOB score:
Is our validation set worse than our training set because we’re over-fitting, or because the validation set is for a different time period, or a bit of both? With the existing information we’ve shown, we can’t tell. However, random forests have a very clever trick called out-of-bag (OOB) error which can handle this (and more!)

What confuses me is the above statement. How can OOB score possibly help us answer the above mentioned issues?

Kiran Hegde

Hello @jcatanza Thanks. So is the R2 score only calcuated for the Validation data. However, in lesson 2, i see that the print_score function calculates R2 even for the training data. Please clarify this.

Kiran Hegde

Yes, @kiranh, the R2 score can also be calculated from the training data, but that will always be higher and does not predict generalizability. You want to pay attention to the Validation and OOB R2 scores (in that order) to see how well your model can predict on new data.

1 Like


So the OOBscore works only with bootstrapping.
Bootstrapping means that we are not using our full dataset to train the model.
Instead we randomly choose about 62.5% of our data. That means they will be unused rows.
This unused rows can be used a validation for a model and we can see how good is our model when trying to predict unknown data.

So if we have both the R2 from validation set and OOBscore they should not be different by a big amount. They both show performance of our model in case of new data.

I think of OOBscore as a “good enough” validation set. Not perfect, but good enough.
The biggest plus we wont make a “bad” validation set with this method.

In specifics: lets say we trained a model and have good train data r2 and good OOBscore.
However the validation r2 is much lower. We can expect that the validation set is different in some crucial way. For example in case of a time series we could have train data for summer and trying to predict autumn sales of sunscreen.

Now let say we have great r2 for training data but much lower both OOBscore and validation r2 score. Based on that we can suspect that our model over fitting to the training data. It may be better to make for example simpler more general model.


1 Like

Thanks @maciejkpl Much appreciated.

Thank you very much for your time @jcatanza

My pleasure, @kiranh!