Does anyone know the difference of feature importance of Random Forest (Bagging Ensemble of trees) vs XGBoost (Boosting Ensemble of trees)?

Thanks for helping!

Does anyone know the difference of feature importance of Random Forest (Bagging Ensemble of trees) vs XGBoost (Boosting Ensemble of trees)?

Thanks for helping!

1 Like

The feature importance in both cases is the same: given a tree go over all the nodes of the tree and do the following: ( From the Elements of Statistical Learning p.368 (freely available here)):

At each such node t, one of the input variables Xv(t) is used to partition the region associated with that node into two subregions; within each a separate constant is fit to the response values. The particular variable chosen is the one that gives maximal estimated improvement () in â€¦ risk over that for a constant fit over the entire region. The squared relative importance of variable is the sum of such squared improvements over all internal nodes for which it was chosen as the splitting variable.>

Thus, both Random Forest and XGBoost generalize this method: for each tree, do the method above. Then, average across all the trees used in each ensemble.

That is the best concise theoretical explanation that I can do. More practical one can come from the docs of the respective libraries. For Random Forest, I recommend you read this cool post from Jeremy and Terence explaining the perils of this technique and why they prefer another mechanism, permutation importance. There they quote this post in Stack Overflow explaining how the above mechanism is implemented in scikitlearn:

It is sometimes called â€śgini importanceâ€ť or â€śmean decrease impurityâ€ť and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.

For XGBoost, the `xgboost.plot_importance`

method gives the following options to plot the variable importances:

How the importance is calculated: either â€śweightâ€ť, â€śgainâ€ť, or â€ścoverâ€ť. â€śweightâ€ť is the number of times a feature appears in a tree. â€śgainâ€ť is the average gain of splits which use the feature. â€ścoverâ€ť is the average coverage of splits which use the feature

where coverage is defined as the number of samples affected by the split.

Note that â€śgainâ€ť would be the most similar to what I said before.

I hope this helps.

13 Likes

That is a great answer David! Learned a lot from it - thank you very much for posting it!

1 Like

Thanks David! This is really helpful.

Great summary.

I would add that Jeremyâ€™s machine learning course goes into significant detail on this topic in lessons 1-3 and if youâ€™re interested in gaining a deeper understanding of random forest and permutation importance the lectures are well worth watching.