OOB - Then and Now (2017-11 vs 2018-10)

When I create a RF using oob_score = True, the OOB score is higher than the validation score (R^2). However, the lesson 2 video indicates that the OOB was lower. Jeremy explained that it could be higher or lower, but it would be lower in general.

Anyone have thoughts on what changed since Nov 2017 would cause the OOB score on the bulldozer data set to move from lower than the validation score to higher?

I download the data set directly from Kaggle;
I ran the code using Colab and using Paperspace;

I remember one of the instances in the lesson notebook (bulldozer rf) where the OOB is higher than R-squared. And Jeremy had clarified that this is a possibility because OOB is built by using the unused test rows. So, if we use a date-sorted data like Kaggle, there is a higher possibility OOB contains rows from the same time period, which makes it more predictable and OOB score being higher.

Thanks for the response. I don’t remember that particular discussion, but the courses offer quite a bit of material to absorb.

Hi, I would continue this discussion and make some assumptions. I think, since OOB score is calculated out of unused rows from train data it should, in general, be a bit better than score on the validation set, which is for a model is harder to predict right (because of validation sample is from future in this particular case). That is why probably we are consistently getting better OOB running this notebook.
Sorry, if I bother you, but maybe @Jeremy or other experienced fellows can explain us this case and share some intuition on the topic.

You don’t bother me any. In fact, I’d like a couple of more contributions from more experienced or insightful people.

I agree with @ademyanchuk (Alexey). I expect the OOB score to be at least as good as the validation score, because it is computed from training set examples that – although not used in the model – were drawn from the same distribution as the data used to build the model. Ideally the validation data should be drawn from the same distribution as the training data, in which case the scores should be similar. But in the case where the validation examples are drawn from a different distribution than the training data (for example, when they are later in time) the validation score should be worse than the OOB score.