Feature importance for trees

(nok) #1

Came across this article, it claims it provide a accurate and consistent feature importance over permutation method. Seems like an interesting tools for interpreting model.

(edited: Here is my ongoing attempt with the library)



(Jeremy Howard) #2

I’ll be interested in hear feedback if anyone tries this. I’ve been meaning to look at it myself for a while, but kinda busy with the fastai library… :slight_smile:

(nok) #3

I am going to read the paper and give it a try on the same notebook of the ML course to get a comparison over this weekend. Hopefully I will have something to share. :slight_smile:

(nok) #4

(you may want to download the html, something graph seems not rending properly with JS on github)

I have a quick question.

I use rfpimp to calculate permutation importance, the sum of the importance is 1.31 instead of 1(not sure about the unit of the “importance” is, maybe is R2 value?), maybe you have more experience with this library?

Please advise what more I should be experimenting. :slight_smile:
It’s not completely done yet, planning to test how does SHAP react to redundant features/highly correlated features and LightGBM.

(Jeremy Howard) #5

Thanks for looking at that! @parrt you might be interested. Here’s a link to the nbviewer rendered version of the notebook:

@nok, I suggest you remove year==1000 from the PDP/ICE plots, otherwise they are hard to interpret. For the tree interpreter, it’s best visualized with this: https://github.com/chrispaulca/waterfall

These 2 changes should give you a better comparison.

(nok) #6

Thank you! Good to learn about nbviewer and thanks for the waterfall chart suggestion, I agree that although the SHAP library comes with fancy js plot, it’s clearer to read information from the waterfall chart.

I struggled on how to add existing plot to a subplot, I could not find a way to make 2 waterfall chart in the same subplot. If I already have the fig and ax of a plot, can I simply add this into another plt.subplots()?

(parrt) #7

Computing permutation importance is a matter of measuring the drop in accuracy when you commute one of the feature columns. Because of this, there’s no reason why the sum of these numbers would be meaningful. Naturally we could normalize this to be between zero and one but it’s really the relative value that matters not the actual value.

SHAP importance is something I haven’t investigated other than to look at the paper and think wow that is pretty complicated mechanism :wink: Apparently it works well and is likely faster than my highly non-optimized/non-parallel implementation of perm imp.

(nok) #8

Not really, in fact the SHAP paper seems put a lot of effort to optimize, as the brute force way to compute is O(2^n). In my trial, I use 1000 sample points, with your permutation importance it takes 8 seconds, while it takes almost 4 minutes for SHAP value computation.

I think it is really complicated, that’s why I rather experiment with it as I figure out I will never understand anything by reading the math myself…:stuck_out_tongue:

I think you are right, now I get what the author talking in the article https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27, the problem of permutation importance is that you cannot sum up individual contribution(Although I found it may not be a big problem? Is there some scenario we really want to sum it up?)

The reason I want to normalize it to 1 is to compare these importance in a kind of same scale. Maybe the relative importance (order) is more important, they seems suggesting different thing in my trial.

(nok) #9

I think you can simply interpret SHAP value as the change of output(y), and sum up the absolute SHAP value to get their relative importance.

“Yes you can use the sum of the absolute values as a measure of the relative importance of the…” https://medium.com/@scottmlundberg/yes-you-can-use-the-sum-of-the-absolute-values-as-a-measure-of-the-relative-importance-of-the-131b55315c70

(Jeremy Howard) #10

Not something I’ve ever needed…

(nok) #11

I think adding up for individual row make sense, less need for feature importance.

The way that SHAP feature importance works is adding up individual rows contributions, this make me thinking that does it make sense if we add up row contributions from tree interpreter to get a feature importance.

(Jeremy Howard) #12

That’s exactly what the classic gini approach to feature importance does.


(chris) #13


Thank you :smiley:

And thank you to Chris whose waterfall package enabled me to create this chart explaining my prediction of biological age using basic blood chemistry:

(nok) #14

Thanks for the reference! I tried to verify this by adding up the absolute individual contributions of tree interpreter, indeed they look very similar (but I do not know why the features with less importance are different). For my case, it is a regressor, so the feature importance is MSE reduction, not Gini impurity. I enjoy knowing these details, and connecting these pieces in my mind!