Another treat! Early access to Intro To Machine Learning videos

asutosh97 · June 9, 2018, 12:49pm

While going through this discussion forum, I came across a few discussions on the bootstrap argument of RandomForestRegressor() function and also about the set_rf_samples().

I also misunderstood it in the beginning and reading the conversations just got me more confused. So, I decided to dig a bit deep into fast.ai and sklearn source codes and came up with the following conclusions -->

n = no_of_rows_in_dataframe
if (bootstrap == false) {
  then all `n` rows are considered exactly once per tree for training
}
else if (set_rf_samples(k) is used) {
  then `k` rows are selected per tree for training & there might be some repetitions of rows
}
else {
  then `n` rows are selected per tree for training & there might be some repetitions of rows
}

Also, there were some ambiguity around the oob_score calculation. So, after exploring a bit, here’s what I concluded —>

/**************************************************************************************	
	for simplicity assuming output corresponding to each input is a single number.
	So, y.shape = (n, 1)
	y = actual outputs
	n = no_of_rows_in_data_frame
	
	For cases with a output vector, the oob_score can be calculated by simply taking average of oob_score 
    of each column of the vector.
****************************************************************************************/

total_prediction = zero_matrix of dimension (n x 1) /* used to accumulate total predictions for each row (by different trees in the forest) which will later be averaged */
  
no_of_predictions = zero_matrix of dimension (n x 1) /* total number of predictions for each row (which also represents total number of trees in which each row is Out-Of-Bag), used for averaging later */
  
for (tree in forest) {
	out_of_bag_samples = all_rows - set(rows used by `tree` for training)
    total_prediction += tree.predict(out_of_bag_samples)
    no_of_predictions = (increased by 1 for each row which was in out_of_bag_sample)
}

predictions = total_prediction / no_of_predictions
oob_score = r2_score(y, predictions)

For exact code of oob_score calculation, refer here