Need for set_rf_samples


(sai kiran) #1

I couldn’t understand the need for doing set_rf_samples.

Why are our metrics improving by doing it? Isn’t RandomForest seeing the complete data eventually by passing bootstrap = True (which is default) ?

Please explain.


(nok) #2

My understanding is, sub-sampling help reduce over-fitting (reduce variance). The default of RandomForest is bootstrapped sample, which effectively use ~66%(63.2% actually) as training data in each tree. (https://stats.stackexchange.com/questions/173520/random-forests-out-of-bag-sample-size)

The less correlated these trees are(the correlation of their error actually), the better result you get as the decision boundary is smoothing out.

Quote:
Random forests take a bunch of decision trees (low bias, high variance), average them (lower variance!). And they use an extra trick (select a random subset of features when splitting) to reduce the correlation between the trees (extra lower variance!!!)

https://jvns.ca/blog/2016/01/02/winning-the-bias-variance-tradeoff/