I’ve trained a RF with 200 estimators and I’m trying to find the optimal number of trees. I’m doing it with the parallel_trees function that @jeremy explains in Lesson 3, like:
preds = np.stack(parallel_trees(model, get_preds))
This takes about 1 minute (I have 230k rows). However, when I call model.predict(train_x), it takes just 5 seconds.
Why does this happen? Aren them doing the same, internally? Passing the 230k rows trough each one of the 200 trees and being the final averaging of the results the only difference? Both cases are parallelized (8 cores) so where’s the difference? Some kind of vectorization maybe?
get_preds calls t.predict (each individual tree of the forest). So that would essentially pass each one of the rows trough each one of the trees, in parallell (because n_jobs=-1). Quite the same thing as with m.predict, isn’t it?
Indeed, most of the time is spent on the parallell_trees function. Then stacking together the arrays of a list is quick:
My question is, why m.predict takes much less if the two functions are parallelized? The only difference I see is that m.predict does that the other doesn’t is averaging the matrix of predictions, but that’s sums and divisions, which are fast too.