Feature Importance of Random Forest vs XGBoost

davidsalazarvergara · June 5, 2018, 4:40pm

The feature importance in both cases is the same: given a tree go over all the nodes of the tree and do the following: ( From the Elements of Statistical Learning p.368 (freely available here)):

At each such node t, one of the input variables Xv(t) is used to partition the region associated with that node into two subregions; within each a separate constant is fit to the response values. The particular variable chosen is the one that gives maximal estimated improvement () in … risk over that for a constant fit over the entire region. The squared relative importance of variable is the sum of such squared improvements over all internal nodes for which it was chosen as the splitting variable.>

Thus, both Random Forest and XGBoost generalize this method: for each tree, do the method above. Then, average across all the trees used in each ensemble.

That is the best concise theoretical explanation that I can do. More practical one can come from the docs of the respective libraries. For Random Forest, I recommend you read this cool post from Jeremy and Terence explaining the perils of this technique and why they prefer another mechanism, permutation importance. There they quote this post in Stack Overflow explaining how the above mechanism is implemented in scikitlearn:

It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.

For XGBoost, the xgboost.plot_importance method gives the following options to plot the variable importances:

How the importance is calculated: either “weight”, “gain”, or “cover”. “weight” is the number of times a feature appears in a tree. “gain” is the average gain of splits which use the feature. “cover” is the average coverage of splits which use the feature
where coverage is defined as the number of samples affected by the split.

Note that “gain” would be the most similar to what I said before.

I hope this helps.