Feature importance for trees

Came across this article, it claims it provide a accurate and consistent feature importance over permutation method. Seems like an interesting tools for interpreting model.

(edited: Here is my ongoing attempt with the library)

reference:


http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions

6 Likes

I’ll be interested in hear feedback if anyone tries this. I’ve been meaning to look at it myself for a while, but kinda busy with the fastai library… :slight_smile:

I am going to read the paper and give it a try on the same notebook of the ML course to get a comparison over this weekend. Hopefully I will have something to share. :slight_smile:

1 Like

(you may want to download the html, something graph seems not rending properly with JS on github)

I have a quick question.

I use rfpimp to calculate permutation importance, the sum of the importance is 1.31 instead of 1(not sure about the unit of the “importance” is, maybe is R2 value?), maybe you have more experience with this library?

Please advise what more I should be experimenting. :slight_smile:
It’s not completely done yet, planning to test how does SHAP react to redundant features/highly correlated features and LightGBM.

Thanks for looking at that! @parrt you might be interested. Here’s a link to the nbviewer rendered version of the notebook:

@nok, I suggest you remove year==1000 from the PDP/ICE plots, otherwise they are hard to interpret. For the tree interpreter, it’s best visualized with this: https://github.com/chrispaulca/waterfall

These 2 changes should give you a better comparison.

1 Like

Thank you! Good to learn about nbviewer and thanks for the waterfall chart suggestion, I agree that although the SHAP library comes with fancy js plot, it’s clearer to read information from the waterfall chart.

I struggled on how to add existing plot to a subplot, I could not find a way to make 2 waterfall chart in the same subplot. If I already have the fig and ax of a plot, can I simply add this into another plt.subplots()?

Computing permutation importance is a matter of measuring the drop in accuracy when you commute one of the feature columns. Because of this, there’s no reason why the sum of these numbers would be meaningful. Naturally we could normalize this to be between zero and one but it’s really the relative value that matters not the actual value.

SHAP importance is something I haven’t investigated other than to look at the paper and think wow that is pretty complicated mechanism :wink: Apparently it works well and is likely faster than my highly non-optimized/non-parallel implementation of perm imp.

Not really, in fact the SHAP paper seems put a lot of effort to optimize, as the brute force way to compute is O(2^n). In my trial, I use 1000 sample points, with your permutation importance it takes 8 seconds, while it takes almost 4 minutes for SHAP value computation.

I think it is really complicated, that’s why I rather experiment with it as I figure out I will never understand anything by reading the math myself…:stuck_out_tongue:

I think you are right, now I get what the author talking in the article https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27, the problem of permutation importance is that you cannot sum up individual contribution(Although I found it may not be a big problem? Is there some scenario we really want to sum it up?)

The reason I want to normalize it to 1 is to compare these importance in a kind of same scale. Maybe the relative importance (order) is more important, they seems suggesting different thing in my trial.

I think you can simply interpret SHAP value as the change of output(y), and sum up the absolute SHAP value to get their relative importance.

“Yes you can use the sum of the absolute values as a measure of the relative importance of the…” https://medium.com/@scottmlundberg/yes-you-can-use-the-sum-of-the-absolute-values-as-a-measure-of-the-relative-importance-of-the-131b55315c70

Not something I’ve ever needed…

I think adding up for individual row make sense, less need for feature importance.

The way that SHAP feature importance works is adding up individual rows contributions, this make me thinking that does it make sense if we add up row contributions from tree interpreter to get a feature importance.

That’s exactly what the classic gini approach to feature importance does.

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp

THIS IS EXACTLY WHAT I NEEDED RIGHT NOW!

Thank you :smiley:

And thank you to Chris whose waterfall package enabled me to create this chart explaining my prediction of biological age using basic blood chemistry:

2 Likes

Thanks for the reference! I tried to verify this by adding up the absolute individual contributions of tree interpreter, indeed they look very similar (but I do not know why the features with less importance are different). For my case, it is a regressor, so the feature importance is MSE reduction, not Gini impurity. I enjoy knowing these details, and connecting these pieces in my mind!

We have used SHAP at work to get an idea why certain predictions are happening. It has been great. There are times where it isn’t perfect (especially as a summary), but it is pretty solid and gives us better insight in our predictions. It has given us some explanation when somebody doesn’t understand a prediction.

1 Like

Do you mean feature importance is less useful than individual row explanation from your experience? It runs pretty slow from my experience, could you share more how do you use it in your workflow?

feature importance is really useful for selling our models to business stakeholders, but they don’t trust it until you can give them individual stories. Usually they match pretty well with what you would expect, but every once in a while you will have one that looks like a total outlier until you look at the shap predictions. What we generated was built (very hackily) from the shap.force_plot code. Basically We just looked inside that to see where the values were happening and extracted it from there. It definitely wasn’t straightforward and was quite slow, but that isn’t a huge deal for what we are doing because we only run our models once per month at the moment.

Once we get all of these into text form, we have a list of how every feature changes every prediction. At that point, we go to a different tool for data manipulation (Alteryx - really good, but pricey). This is something that could definitely be done in Python as well, but for us, it is easy enough to move over and just do some ETL on it to get it exactly how we want it formatted.

Here is what I used to generate the code we use: https://github.com/slundberg/shap/blob/e9c750a45fc027227c87dbfbe34ccc6143632aa1/shap/plots/force.py

Specifically these were really useful lines:

        instance = Instance(np.zeros((1, len(feature_names))), features)
        e = AdditiveExplanation(
            base_value,
            np.sum(shap_values[0, :]) + base_value,
            shap_values[0, :],
            None,
            instance,
            link,
            Model(None, out_names),
            DenseData(np.zeros((1, len(feature_names))), list(feature_names))
        )

The e ends up containing all of the numbers you need. As far as I could tell there wasn’t a good way to do this without making it a little hacky.

Thanks for your generous sharing, I try to hack into the e instance but I am not sure what I am looking for.

Once we get all of these into text form, we have a list of how every feature changes every prediction.

Isn’t this the shap value?


What are the attributes here that you need to hack inside the force_plot?

We have also used Lundberg’s SHAP library to produce prediction-level interpretations for client projects. Client’s love having case-level results.

1 Like

I have used it for tree base model as well, it is quite nice. Has anyone try to use it for more complicated NN model?