Understanding subsampling

I am having some difficulty understanding when to subsample. My understanding was that using this technique was a way of speeding up playing around with models prior to actually fitting a thought-out one. The payoff for the speed is that the no single tree is using all the data, and therefore,worse scores.

My question is, can/should this be applied to fitting my final model? I am running into memory issues (that subsampling would help with) but from what I’ve seen in the lecture and other places is that the set_rf_samples() was also reset before fitting the final model.

Also, would it make sense to subsample, and then use a large number of estimators to try to get the model to see all (or at least most) of the data?

1 Like

I think your understanding of subsampling is correct. If memory limits the size of your final model, it makes good sense to use subsampling with a larger number of trees to improve accuracy. This choice trades memory for computation time, but may allow you to achieve comparable accuracy as if you had enough memory to store a complete bootstrap sample.

I suggest you try out your idea of subsampling with progressively larger numbers of trees and see whether your accuracy improves. I’d love to hear what you find out!

I share the same confusion with op. Isn’t this slightly the same with bagging but we’re smaller samples? Bagging is also random subsets right?

I’ve experimented on it, the results are comparable to the model trained with the whole dataset but trained with less time. But does it really reduce overfitting or it’s just for faster training time?