Random Forests and Collinearity/Correlation

I was listening to a podcast the other day about ensemble techniques and the host said that random forests can, sometimes, not perform as expected when the weak-learners (the individual models) are correlated or the features exhibit collinearity. The reasoning being that in a voting scheme, correlated models skew the outcome in their favor.

I was wondering if, in practice, that is often a concern that needs to be addressed. If so, what are so good approaches to mitigate? And if not, why? Using the tools we’ve learned previously, I was thinking that performing Ridge Regression (or elastic net with more weight toward Ridge) would be a decent start.

Also, if I’ve messed up any terminology, please correct me!

@fryanpan I was reading about parameter tuning in random forest and the collinearity problem can be somewhat reduced by specifically controlling ‘max_features’ parameter. This basically controls the number of features used in each tree to build it. If we are using all features in each of the tree you’ll lose diversity across individual trees which somewhat defeats the purpose of random forest.
Please correct me if I’m wrong.

For more details refer to this link:

4 Likes

Not sure if it’s technically sound, but I was curious to see what would happen by running elastic net before random forests on the Bulldozer example from class. I set alpha to 0.65 (~2/3, so leaning toward L1/LASSO) and it dropped the number of columns from ~52 to 21. I upped the number of estimators in the Random Forest and the score met or exceeded (maybe by only a few thousandths) what we did in class. I know it’s only a single data point, but I was surprised to see similar results from only 2/5 of the number of columns in this simple example.

take a look at the “importances” returned from RF. you’ll see which vars are most predictive and which are irrelevant. :slight_smile:

1 Like

We’ll be tackling this in the next 2 lessons. But basically: what @shik1470 said :slight_smile: