Need for set_rf_samples

sakiran · June 11, 2018, 7:14pm

I couldn’t understand the need for doing set_rf_samples.

Why are our metrics improving by doing it? Isn’t RandomForest seeing the complete data eventually by passing bootstrap = True (which is default) ?

Please explain.

nok · June 18, 2018, 5:11pm

My understanding is, sub-sampling help reduce over-fitting (reduce variance). The default of RandomForest is bootstrapped sample, which effectively use ~66%(63.2% actually) as training data in each tree. (https://stats.stackexchange.com/questions/173520/random-forests-out-of-bag-sample-size)

The less correlated these trees are(the correlation of their error actually), the better result you get as the decision boundary is smoothing out.

Quote:
Random forests take a bunch of decision trees (low bias, high variance), average them (lower variance!). And they use an extra trick (select a random subset of features when splitting) to reduce the correlation between the trees (extra lower variance!!!)

https://jvns.ca/blog/2016/01/02/winning-the-bias-variance-tradeoff/