Feature importance in deep learning

Pak · June 14, 2020, 5:36pm

You can look into my version of partial dependence implementation for fastai2 https://github.com/Pak911/fastai2-tabular-interpretation

randy912 · August 24, 2020, 9:06pm

Terence Parr just released a paper, “Nonparametric Feature Impact and Importance” https://arxiv.org/pdf/2006.04750.pdf that estimates feature importance and impact. Python package at github.com/parrt/stratx.

mrfabulous1 · August 25, 2020, 7:19am

Hi Pak Great Post.
Can’t wait to try it out when I build a tabular model!.
Cheers mrfabulous1

jaxondk · October 22, 2020, 4:22pm

It seems that a lot of people simply disregard the required assumption of feature independence. What are the risks? How does multicollinearity affect the feature importance scores? If I know that a few of my features are correlated, is there a way I can interpret the feature importance scores better because of that knowledge? IE will only the correlated features have incorrect scores, or will all features have incorrect scores in the presence of multicollinearity of some subset of the features?

These are general questions, but I’m using SHAP specifically. For further details or if you want to contribute to the discussion over at the github for SHAP, see my post there.

muellerzr · October 22, 2020, 4:36pm

My $0.02: Through folks that have domain-expertise, we’ve seen that the results of the feature importance make sense. For instance if I was looking at salary I would expect their job title to be a very important feature, which we’ve seen is on the adults dataset.

In regards to multicollinearity, you could do so by doing permutation importance on nxn features, but that is a giant rabbit hole. Along with this, generally what I have seen is if features are indeed tightly correlated, I’ll notice a stark change in the relative feature importance once removed, and in my work they tend to be grouped together in the rankings (IE very similar in rank).

I’ve also opened this question up to the twitter-verse, to see what spits back. I’ll update this post with anything interesting they say

Update 1: I was given 3 papers about the shapley library (different from SHAP apparently!) @jaxondk:

jaxondk · October 22, 2020, 4:44pm

Thanks for the quick response! And for reaching out to the twitter-verse

When you say:

I’ll notice a stark change in the relative feature importance once removed

what do you mean? Are you saying that while performing permutation importance, if a feature gets removed that is highly correlated with another feature it will get a large feature importance score? Not sure what you mean by “once removed”.

Update 1: Yes, shapley values and shap values are somewhat different. SHAP is a library that has some optimized approximations of shapley values for various types of models, one of which are neural nets.

PS - what is your twitter handle?

muellerzr · October 22, 2020, 4:47pm

So say workclass and living location are two variables, and (for the sake of exercise) they’re highly correlated. It would be my expectation that during feature importance if I were to shuffle living location the overall % would decrease, and vice-versa I would (presumably) see that once workclass was shuffled they’d shift by the same % or so. (this may be naivety too )

Twitter handle is @TheZachMueller

jaxondk · October 22, 2020, 5:27pm

OK, so your expectation (which you say may or may not be naive) is that correlated features will have similar global feature importance scores (or rankings).

Unfortunately, I don’t seem to be seeing that in my case for all of my models. I have 3 features that are >=.96 correlated (pearson coefficient) with each other.

For one of the models, they apparently aren’t used at all. But on another model, they are, but they have varying importances. On their global score (using SHAP values from DeepSHAP), one of them is ranked high (2nd highest), and the other 2 are fairly low.

This is concerning to me and I’m not sure how to work around the required assumption of feature independence. Not sure if I can trust the rankings for any of them, or if I just can’t trust the ones that are actually correlated.

I’m just skimming through some of those papers. “True to the Model or True to the Data” focuses on local interpretations, and right now at least I’m more concerned with global explanations. But that paper does link to Understanding Global Feature Contributions, which supposedly introduces a new method called SAGE that can robustly handle correlated features and complex feature interactions.

I’ll continue to update with things I learn

jaxondk · October 22, 2020, 10:45pm

Here are my notes on just a couple of these papers as well as some very useful threads/comments in various locations. Please realize that I skimmed these papers fairly quickly, and I also don’t have a PhD and am fairly removed from academia, so I don’t read massive amounts of papers on a regular basis. All that being said, I could totally be misinterpreting things and would love for someone to correct me on anything here I feel fairly confident about the first one, and honestly barely looked at the second but included a small blurb about it anyway. I could be oversimplifying things in my conclusion section, so please provide some feedback so I don’t mislead the community or myself

True to the Model or True to the Data

Discusses observational vs interventional conditional expectations when doing feature importance, and how neither is preferred generally but rather it’s application-specific, and depends on if you want to be “true to the model” or “true to the data”. Things like permutation importance and the SHAP approximations in DeepSHAP are interventional (seems Lundberg, author of shap, agrees), or “true to the model”.

The paper states that if you have independent features, importance values are the same for observational and interventional. Otherwise, “correlation splits the Beta as credit between correlated variables and higher levels of correlation leads to slower convergence of the observational Shapley value estimates.” However, I don’t know if this means they split it evenly between the correlated features. If it weren’t split evenly, then you may have one correlated feature look more important than the other features it’s correlated with disproportionately. However, this splitting only occurs if doing observational. This is because in interventional, it only shows you what the model is actually using. If it only uses one of the correlated features, it won’t split the credit between the other correlated features.

Conclusion: Essentially, “true to the model” = interventional, “true to the data” = observational. “Being true to the model is the best choice for most applications of explainable AI, where the goal is to explain the model itself.” However, if you are “focused on scientific discovery”, you will likely want to be true to the data.
The authors note that in an ideal world, you can reparameterize your model to get at the underlying independent factors. Then you can use interventional techniques without getting off the data manifold.

Understanding Global Feature Contributions

I ended up not spending too much time on this one. I was excited about it, as it seemed to be a method for handling correlated features when doing global feature importance, but then read this:

In practice we sample from the marginal distribution, which corresponds to an assumption of
feature independence

I had thought this paper was about dealing with feature dependence. I stopped reading at this point, though to be honest I’m sure I’m missing the point on this one.

Scouring various threads

“SHAP aligns with casual interventional perturbations” (Lundberg). Honestly, that whole thread was very useful
In the presence of correlated features, you cannot be both true to the data and true to the model. Either you must provide some data as inputs that are off-manifold (if you use interventional), or you must allow credit to bleed between correlated features (if you use observational). This is also the main gist of the “true to model / true to data” paper. (Lundberg)
This comment discusses some of the differences between observational and interventional, and also provided links to some of these other threads.

The Takeaway

This was really helpful for me. I would suspect that most practitioners here are more concerned about being true to the model. If that’s the case, it seems that things like permutation importance or SHAP values will work for you, even in the presence of correlated features. You will just see what the model used, regardless of multicollinearity.

For those more interested in scientific discovery, it looks like observational shapley values are what you want. Currently, I’m unaware of a package for calculating these for neural nets. Maybe SHAP has this functionality optionally that I’m unaware of. I would be very interested in learning about a package to compute these, as I am involved in both simply explaining a model and scientific discovery.