Need for set_rf_samples

(sai kiran) #1

I couldn’t understand the need for doing set_rf_samples.

Why are our metrics improving by doing it? Isn’t RandomForest seeing the complete data eventually by passing bootstrap = True (which is default) ?

Please explain.

(nok) #2

My understanding is, sub-sampling help reduce over-fitting (reduce variance). The default of RandomForest is bootstrapped sample, which effectively use ~66%(63.2% actually) as training data in each tree. (

The less correlated these trees are(the correlation of their error actually), the better result you get as the decision boundary is smoothing out.

Random forests take a bunch of decision trees (low bias, high variance), average them (lower variance!). And they use an extra trick (select a random subset of features when splitting) to reduce the correlation between the trees (extra lower variance!!!)