Scikit-learn RandomForest

Hello,

I have read the RandomForest docs and it has this description about random subset selection:

In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

When I tried bootstrap=False, max_features=1.0 without subsampling, which means using my whole dataset and whole feature set. I still observe that different trees are constructed by getting different scores from the same train and val sets. That being said if bootstrapping, subsampling and max_features are not specified how is it still randomizing hence giving different results at each run.

Ex code:

m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1, max_features=1.0)
m.fit(X_train, y_train)
print_score(m)

Shouldn’t the best 1 tree with max_depth = 3 here be deterministic ?

If it is considering another subsampling by rows eventhough bootstrap=False, what is the default option of it because documents clearly only state this issue I’ve pasted above.

Thanks !

1 Like

did you set the rf sample size or whatever the fastai lib func is? that would add randomness.

No, but just in case I also ran reset_rf_samples()

But if params subsample = None, max_feature = 1.0 and bootstrap = False we don’t expect to see any randomness right ? I wonder why didn’t they add max_subsample as an argument ?

Interesting question. Try max_features=None - maybe None and 1.0 are different somehow.

Excellent question!

1 Like

Still inconsistent…

reset_rf_samples()
m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1, max_features=None)
m.fit(X_train, y_train)
print_score(m)

OK you’ve reached the limits of my knowledge of sklearn… Want to check the sklearn source code? Look at the source for reset_rf_samples() to see roughly what to look for.

Maybe this ?:

To clarify: if the algorithm computes the score for feature A and then computes the score for feature B and it gets score N. Or if it computes first the score for feature B and then for feature A and it gets the same score N, you can see how each decision tree will be different, and have different scores during test, even if the train test is the same (100% if max_depth=None of course). (You can confirm this.)

I guess that’s possible - using the debugger can help dig into stuff like this. https://stackoverflow.com/questions/32409629/what-is-the-right-way-to-debug-in-ipython-notebook

1 Like

So I tried this –

Train score is coming same up-to the last decimal every time. Test score is only different for last 3 decimals (out of 17 decimals). I think model is consistent. The inconsistency in test score might be just due to numeric/precision fluctuation in python as it converts numbers from binary? If that is the case, why train score exactly same?

1 Like

try passing random=99 to RF constructor. it forces same pseudo-random sequence every time. see if same same.

But if we are using **random_state = 99", then it will force same sequence and consequently same score for Bootstap = True too.
It checked for both bootstrap = False and True, it’s same every time.

I was trying to point to Kerem’s post about inconsistency in results when using bootstrap = False. I think, they are ‘consistent’ with just minor variations in test score (maybe due to numeric to binary conversion or vice-versa)