Feature importance in deep learning

You can look into my version of partial dependence implementation for fastai2 https://github.com/Pak911/fastai2-tabular-interpretation

5 Likes

Terence Parr just released a paper, ā€œNonparametric Feature Impact and Importanceā€ https://arxiv.org/pdf/2006.04750.pdf that estimates feature importance and impact. Python package at github.com/parrt/stratx.

2 Likes

Hi Pak Great Post.
Canā€™t wait to try it out when I build a tabular model!.
Cheers mrfabulous1 :smiley: :smiley:

It seems that a lot of people simply disregard the required assumption of feature independence. What are the risks? How does multicollinearity affect the feature importance scores? If I know that a few of my features are correlated, is there a way I can interpret the feature importance scores better because of that knowledge? IE will only the correlated features have incorrect scores, or will all features have incorrect scores in the presence of multicollinearity of some subset of the features?

These are general questions, but Iā€™m using SHAP specifically. For further details or if you want to contribute to the discussion over at the github for SHAP, see my post there.

1 Like

My $0.02: Through folks that have domain-expertise, weā€™ve seen that the results of the feature importance make sense. For instance if I was looking at salary I would expect their job title to be a very important feature, which weā€™ve seen is on the adults dataset.

In regards to multicollinearity, you could do so by doing permutation importance on nxn features, but that is a giant rabbit hole. Along with this, generally what I have seen is if features are indeed tightly correlated, Iā€™ll notice a stark change in the relative feature importance once removed, and in my work they tend to be grouped together in the rankings (IE very similar in rank).

Iā€™ve also opened this question up to the twitter-verse, to see what spits back. Iā€™ll update this post with anything interesting they say :smiley:

Update 1: I was given 3 papers about the shapley library (different from SHAP apparently!) @jaxondk:

4 Likes

Thanks for the quick response! And for reaching out to the twitter-verse :wink:

When you say:

Iā€™ll notice a stark change in the relative feature importance once removed

what do you mean? Are you saying that while performing permutation importance, if a feature gets removed that is highly correlated with another feature it will get a large feature importance score? Not sure what you mean by ā€œonce removedā€.

Update 1: Yes, shapley values and shap values are somewhat different. SHAP is a library that has some optimized approximations of shapley values for various types of models, one of which are neural nets.

PS - what is your twitter handle?

1 Like

So say workclass and living location are two variables, and (for the sake of exercise) theyā€™re highly correlated. It would be my expectation that during feature importance if I were to shuffle living location the overall % would decrease, and vice-versa I would (presumably) see that once workclass was shuffled theyā€™d shift by the same % or so. (this may be naivety too :slight_smile: )

Twitter handle is @TheZachMueller

OK, so your expectation (which you say may or may not be naive) is that correlated features will have similar global feature importance scores (or rankings).

Unfortunately, I donā€™t seem to be seeing that in my case for all of my models. I have 3 features that are >=.96 correlated (pearson coefficient) with each other.

For one of the models, they apparently arenā€™t used at all. But on another model, they are, but they have varying importances. On their global score (using SHAP values from DeepSHAP), one of them is ranked high (2nd highest), and the other 2 are fairly low.

This is concerning to me and Iā€™m not sure how to work around the required assumption of feature independence. Not sure if I can trust the rankings for any of them, or if I just canā€™t trust the ones that are actually correlated.

Iā€™m just skimming through some of those papers. ā€œTrue to the Model or True to the Dataā€ focuses on local interpretations, and right now at least Iā€™m more concerned with global explanations. But that paper does link to Understanding Global Feature Contributions, which supposedly introduces a new method called SAGE that can robustly handle correlated features and complex feature interactions.

Iā€™ll continue to update with things I learn

1 Like

Here are my notes on just a couple of these papers as well as some very useful threads/comments in various locations. Please realize that I skimmed these papers fairly quickly, and I also donā€™t have a PhD and am fairly removed from academia, so I donā€™t read massive amounts of papers on a regular basis. All that being said, I could totally be misinterpreting things and would love for someone to correct me on anything here :slight_smile: I feel fairly confident about the first one, and honestly barely looked at the second but included a small blurb about it anyway. I could be oversimplifying things in my conclusion section, so please provide some feedback so I donā€™t mislead the community or myself :laughing:

True to the Model or True to the Data

Discusses observational vs interventional conditional expectations when doing feature importance, and how neither is preferred generally but rather itā€™s application-specific, and depends on if you want to be ā€œtrue to the modelā€ or ā€œtrue to the dataā€. Things like permutation importance and the SHAP approximations in DeepSHAP are interventional (seems Lundberg, author of shap, agrees), or ā€œtrue to the modelā€.

The paper states that if you have independent features, importance values are the same for observational and interventional. Otherwise, ā€œcorrelation splits the Beta as credit between correlated variables and higher levels of correlation leads to slower convergence of the observational Shapley value estimates.ā€ However, I donā€™t know if this means they split it evenly between the correlated features. If it werenā€™t split evenly, then you may have one correlated feature look more important than the other features itā€™s correlated with disproportionately. However, this splitting only occurs if doing observational. This is because in interventional, it only shows you what the model is actually using. If it only uses one of the correlated features, it wonā€™t split the credit between the other correlated features.

Conclusion: Essentially, ā€œtrue to the modelā€ = interventional, ā€œtrue to the dataā€ = observational. ā€œBeing true to the model is the best choice for most applications of explainable AI, where the goal is to explain the model itself.ā€ However, if you are ā€œfocused on scientific discoveryā€, you will likely want to be true to the data.
The authors note that in an ideal world, you can reparameterize your model to get at the underlying independent factors. Then you can use interventional techniques without getting off the data manifold.

Understanding Global Feature Contributions

I ended up not spending too much time on this one. I was excited about it, as it seemed to be a method for handling correlated features when doing global feature importance, but then read this:

In practice we sample from the marginal distribution, which corresponds to an assumption of
feature independence

I had thought this paper was about dealing with feature dependence. I stopped reading at this point, though to be honest Iā€™m sure Iā€™m missing the point on this one.

Scouring various threads

  • ā€œSHAP aligns with casual interventional perturbationsā€ (Lundberg). Honestly, that whole thread was very useful
  • In the presence of correlated features, you cannot be both true to the data and true to the model. Either you must provide some data as inputs that are off-manifold (if you use interventional), or you must allow credit to bleed between correlated features (if you use observational). This is also the main gist of the ā€œtrue to model / true to dataā€ paper. (Lundberg)
  • This comment discusses some of the differences between observational and interventional, and also provided links to some of these other threads.

The Takeaway

This was really helpful for me. I would suspect that most practitioners here are more concerned about being true to the model. If thatā€™s the case, it seems that things like permutation importance or SHAP values will work for you, even in the presence of correlated features. You will just see what the model used, regardless of multicollinearity.

For those more interested in scientific discovery, it looks like observational shapley values are what you want. Currently, Iā€™m unaware of a package for calculating these for neural nets. Maybe SHAP has this functionality optionally that Iā€™m unaware of. I would be very interested in learning about a package to compute these, as I am involved in both simply explaining a model and scientific discovery.

5 Likes