You can look into my version of partial dependence implementation for fastai2 https://github.com/Pak911/fastai2-tabular-interpretation
Terence Parr just released a paper, āNonparametric Feature Impact and Importanceā https://arxiv.org/pdf/2006.04750.pdf that estimates feature importance and impact. Python package at github.com/parrt/stratx.
Hi Pak Great Post.
Canāt wait to try it out when I build a tabular model!.
Cheers mrfabulous1
It seems that a lot of people simply disregard the required assumption of feature independence. What are the risks? How does multicollinearity affect the feature importance scores? If I know that a few of my features are correlated, is there a way I can interpret the feature importance scores better because of that knowledge? IE will only the correlated features have incorrect scores, or will all features have incorrect scores in the presence of multicollinearity of some subset of the features?
These are general questions, but Iām using SHAP specifically. For further details or if you want to contribute to the discussion over at the github for SHAP, see my post there.
My $0.02: Through folks that have domain-expertise, weāve seen that the results of the feature importance make sense. For instance if I was looking at salary I would expect their job title to be a very important feature, which weāve seen is on the adults dataset.
In regards to multicollinearity, you could do so by doing permutation importance on nxn
features, but that is a giant rabbit hole. Along with this, generally what I have seen is if features are indeed tightly correlated, Iāll notice a stark change in the relative feature importance once removed, and in my work they tend to be grouped together in the rankings (IE very similar in rank).
Iāve also opened this question up to the twitter-verse, to see what spits back. Iāll update this post with anything interesting they say
Update 1: I was given 3 papers about the shapley library (different from SHAP apparently!) @jaxondk:
Thanks for the quick response! And for reaching out to the twitter-verse
When you say:
Iāll notice a stark change in the relative feature importance once removed
what do you mean? Are you saying that while performing permutation importance, if a feature gets removed that is highly correlated with another feature it will get a large feature importance score? Not sure what you mean by āonce removedā.
Update 1: Yes, shapley values and shap values are somewhat different. SHAP is a library that has some optimized approximations of shapley values for various types of models, one of which are neural nets.
PS - what is your twitter handle?
So say workclass and living location are two variables, and (for the sake of exercise) theyāre highly correlated. It would be my expectation that during feature importance if I were to shuffle living location the overall % would decrease, and vice-versa I would (presumably) see that once workclass was shuffled theyād shift by the same % or so. (this may be naivety too )
Twitter handle is @TheZachMueller
OK, so your expectation (which you say may or may not be naive) is that correlated features will have similar global feature importance scores (or rankings).
Unfortunately, I donāt seem to be seeing that in my case for all of my models. I have 3 features that are >=.96 correlated (pearson coefficient) with each other.
For one of the models, they apparently arenāt used at all. But on another model, they are, but they have varying importances. On their global score (using SHAP values from DeepSHAP), one of them is ranked high (2nd highest), and the other 2 are fairly low.
This is concerning to me and Iām not sure how to work around the required assumption of feature independence. Not sure if I can trust the rankings for any of them, or if I just canāt trust the ones that are actually correlated.
Iām just skimming through some of those papers. āTrue to the Model or True to the Dataā focuses on local interpretations, and right now at least Iām more concerned with global explanations. But that paper does link to Understanding Global Feature Contributions, which supposedly introduces a new method called SAGE that can robustly handle correlated features and complex feature interactions.
Iāll continue to update with things I learn
Here are my notes on just a couple of these papers as well as some very useful threads/comments in various locations. Please realize that I skimmed these papers fairly quickly, and I also donāt have a PhD and am fairly removed from academia, so I donāt read massive amounts of papers on a regular basis. All that being said, I could totally be misinterpreting things and would love for someone to correct me on anything here I feel fairly confident about the first one, and honestly barely looked at the second but included a small blurb about it anyway. I could be oversimplifying things in my conclusion section, so please provide some feedback so I donāt mislead the community or myself
True to the Model or True to the Data
Discusses observational vs interventional conditional expectations when doing feature importance, and how neither is preferred generally but rather itās application-specific, and depends on if you want to be ātrue to the modelā or ātrue to the dataā. Things like permutation importance and the SHAP approximations in DeepSHAP are interventional (seems Lundberg, author of shap, agrees), or ātrue to the modelā.
The paper states that if you have independent features, importance values are the same for observational and interventional. Otherwise, ācorrelation splits the Beta as credit between correlated variables and higher levels of correlation leads to slower convergence of the observational Shapley value estimates.ā However, I donāt know if this means they split it evenly between the correlated features. If it werenāt split evenly, then you may have one correlated feature look more important than the other features itās correlated with disproportionately. However, this splitting only occurs if doing observational. This is because in interventional, it only shows you what the model is actually using. If it only uses one of the correlated features, it wonāt split the credit between the other correlated features.
Conclusion: Essentially, ātrue to the modelā = interventional, ātrue to the dataā = observational. āBeing true to the model is the best choice for most applications of explainable AI, where the goal is to explain the model itself.ā However, if you are āfocused on scientific discoveryā, you will likely want to be true to the data.
The authors note that in an ideal world, you can reparameterize your model to get at the underlying independent factors. Then you can use interventional techniques without getting off the data manifold.
Understanding Global Feature Contributions
I ended up not spending too much time on this one. I was excited about it, as it seemed to be a method for handling correlated features when doing global feature importance, but then read this:
In practice we sample from the marginal distribution, which corresponds to an assumption of
feature independence
I had thought this paper was about dealing with feature dependence. I stopped reading at this point, though to be honest Iām sure Iām missing the point on this one.
Scouring various threads
- āSHAP aligns with casual interventional perturbationsā (Lundberg). Honestly, that whole thread was very useful
- In the presence of correlated features, you cannot be both true to the data and true to the model. Either you must provide some data as inputs that are off-manifold (if you use interventional), or you must allow credit to bleed between correlated features (if you use observational). This is also the main gist of the ātrue to model / true to dataā paper. (Lundberg)
- This comment discusses some of the differences between observational and interventional, and also provided links to some of these other threads.
The Takeaway
This was really helpful for me. I would suspect that most practitioners here are more concerned about being true to the model. If thatās the case, it seems that things like permutation importance or SHAP values will work for you, even in the presence of correlated features. You will just see what the model used, regardless of multicollinearity.
For those more interested in scientific discovery, it looks like observational shapley values are what you want. Currently, Iām unaware of a package for calculating these for neural nets. Maybe SHAP has this functionality optionally that Iām unaware of. I would be very interested in learning about a package to compute these, as I am involved in both simply explaining a model and scientific discovery.