I noticed that somewhere in Lesson 2 or 3 that you have mentioned if we use set_rf_samples(), we should not be using oob_score. So that rule still should be followed, or that issue is fixed?
Another treat! Early access to Intro To Machine Learning videos
While going through this discussion forum, I came across a few discussions on the bootstrap
argument of RandomForestRegressor()
function and also about the set_rf_samples()
.
I also misunderstood it in the beginning and reading the conversations just got me more confused. So, I decided to dig a bit deep into fast.ai and sklearn source codes and came up with the following conclusions >
n = no_of_rows_in_dataframe
if (bootstrap == false) {
then all `n` rows are considered exactly once per tree for training
}
else if (set_rf_samples(k) is used) {
then `k` rows are selected per tree for training & there might be some repetitions of rows
}
else {
then `n` rows are selected per tree for training & there might be some repetitions of rows
}
Also, there were some ambiguity around the oob_score calculation. So, after exploring a bit, here’s what I concluded —>
/**************************************************************************************
for simplicity assuming output corresponding to each input is a single number.
So, y.shape = (n, 1)
y = actual outputs
n = no_of_rows_in_data_frame
For cases with a output vector, the oob_score can be calculated by simply taking average of oob_score
of each column of the vector.
****************************************************************************************/
total_prediction = zero_matrix of dimension (n x 1) /* used to accumulate total predictions for each row (by different trees in the forest) which will later be averaged */
no_of_predictions = zero_matrix of dimension (n x 1) /* total number of predictions for each row (which also represents total number of trees in which each row is OutOfBag), used for averaging later */
for (tree in forest) {
out_of_bag_samples = all_rows  set(rows used by `tree` for training)
total_prediction += tree.predict(out_of_bag_samples)
no_of_predictions = (increased by 1 for each row which was in out_of_bag_sample)
}
predictions = total_prediction / no_of_predictions
oob_score = r2_score(y, predictions)
For exact code of oob_score calculation, refer here
I’ve just published the Lecture 1 Notes (With Jeremy’s permission).
Hope these are helpful and please do point out any points that could be corrected/improved.
I believe I’ll be able to share all the Notes before the End of this month.
Sanyam.
Hello asutosh97
About the oob score I have a question:
if i understood what jeremy said (english is not my native language) the oob score allows you not to need a validation set to see how well the model works. For this reason it is also useful when we have little data.
my question is why in notebooks jeremy uses:
m = RandomForestRegressor(n_estimators=40, n_jobs=1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)
If i use the oob_score shouldn’t I use m.fit with the complete data and not only with the x_train and the y_train?
In the 2nd lesson there intruduction to max_features of random forest. For me it looks similar to dropout of neural network from deep learning, is it correct intuition?
Hello @fumpen,
as Jeremy says in one of his lectures, we can’t use any of the test data for calibration. Think of it as you don’t have it until you’ve trained your model completely. Else, you can’t get true results.
Hello Everyone,
In Lecture 2, @jeremy explains how a decision tree is formed by selecting a variable and a splitpoint, at each step, which yields the lowest MSE (as per the naive model). Can someone please explain why exactly is this the splitting methodology? From another source, decision tree splitting is done using the ‘Information Gain’. How are these two (MSE and Information Gain) connected?
hello @vahuja4, think it this way
Information Gain = MSE at Root Node  Avg. MSE of the childs after splitting
So, IG will be more when Avg. MSE drops the most. Both are basically indicating the same thing only.
I see. But why is this termed to be ‘Information Gain’? Also, would you know why is this the chosen methodology for splitting?

I think it is termed like that by convention because the more close your predictions come to actual values, you seem to have gained more information. And MSE basically denotes the gap between the actual values and model prediction. So, the closer the gap becomes(i.e. the more the MSE drops), it can be thought of as more information is gained.

As you know in DecisionTreeRegressor, the pred`iction at a node is given by taking the average of all the data points belonging to it. So, our ultimate goal is to make this average as close to the actual value.
So, we basically do a bruteforce search of all possible splitting and check which one will give average closest to the actual values, and use MSE as a metric to measure the closeness.
I hope these answer your questions.
Hey guys, if you want even a deeper understanding of your tree based models/xgb/sklearn etc, check this cool repos out,
What are your thoughts on this Jeremy?
(It’s really nice to interpret the Black Boxes properly…)
Both look Promising
Probably the notebooks are there…
The plots looks amazing…
https://nbviewer.jupyter.org/github/slundberg/shap/tree/master/notebooks/
I’m trying out techniques learnt in lessons 1 and 2 on the house prices kaggle competition
the training set has 1460 rows, should i still split it into 2 to get a separate validation set or should I just rely on oob_score?
I submitted my predictions to kaggle.
On my validation set, I had a root mean squared error of 0.0486755 but on kaggle my error was 0.14651 placing me around 2407 on the leaderboard.
Model is at
would be glad if you could have a look at it and help me improve my score.
I just wish that you start international fellowship program for this course too and make it possible for international students to make the most out of it and also the part 2 of ML course. The first seven videos of ML1 is a gold mine for tree based models.