Feature importance in deep learning

Thank you both, this has helped me understand why TabNet is such a big thing (and a better understanding on the bigger issues with a FCNN for tabular interpretibility ). I appreciate the thorough thought into both, it’s given me much to think about :slight_smile:

Here is a idea I suggested in the other thread:

Given a trained model f(x0,...,xn) , we might be able to learn the feature importance by simulating the attention.

For each feature xi , we can associate an uninformative value mi (the mean or the mean of the embedding). We can then create a dampened feature xi' = ai*xi + (1-ai)*mi with the sum of all ai equal to 1 (this can be enforced by a sofmax).

We now maximize the accuracy of f(x0',...,xn') by optimizing the ai , these are our importances.
(that’s a way to ask to network to make a decision : which features can be dropped and which features bring information to the plate ?)

There is one hidden parameter here which is the fact that the ai sum to one : I see no particular reason to use 1 and not another number between 1 and n-1 (given n features).
Let’s call this parameter k, in a way it is what we think is the number of important features (my gut feeling would be that sqrt(n) should be a good default).

I think that another way to deal with k would be to sum each weight over all possible values of k from 1 to n-1.

@hubert.misztela To answer you questions, I have not implemented it (I had the idea as I was typing it and I don’t have enough time in the short term to build a prototype) and this would indeed be about feature importance and not sample specific quantities.
I believe that, as it is written, it would focus on features important for most samples and not features rarely important.
It might even be possible to add self attention to select different features for different samples. We would then have all the interpretability properties of an attention based model but for arbitrary tabular models.

@Bwarner’s warnings are important but do note, however, that the vast majority of methods suppose feature independence (which is false more often than not) and that they work well enough nevertheless.

(I work with people doing exactly that when analysing simulation codes)

1 Like

Hi there,
I think you might get some idea in This website below can be usefull for your question.

I had the same issue before. I got some aspect here.

Good luck.

Moved this here as it’s more relevent:

On the topic of attention, this was recently published:

And the authors claim it’s competitive :slight_smile: (The model itself is in SAN.py :wink: )

3 Likes

I just extracted @muellerzr’s permutation feature importance snippet into a dedicated repository.

Doing so I found a nicely documented book on the state of the art of interpretability and machine learning (which details the theory, pro, con, and references implementations of various methods) :

5 Likes

Is there a library support partial dependence plot for neural network? (e.g. integration with PyTorch)

I’d check out chapter 9 of the fastbook. It describes the general methodology for partial dependence and then you should be able to just integrate it in similar to how we have done for the feature importance (IE make all values in column equal to A from n to n+t where n is the starting value and t is the ending value. It can be done fairly straightforward from there. (Of course then you have the question of is it “truly” representative, same argument as feature importance in this thread, so take it with a grain of salt)

I understand it is a general method (model agnostic), but seems most library heavily integrate it with tree models. I do not see people are using it for neural network, there are generally too much focus on SHAP IMO.

The work with SHAP seems very interesting, but the link gives me error 404. Has the repo been moved or renamed?

It’s now in my fastinference library :slight_smile:

Docs: muellerzr.github.io/fastinference

3 Likes

You can look into my version of partial dependence implementation for fastai2 https://github.com/Pak911/fastai2-tabular-interpretation

5 Likes

Terence Parr just released a paper, “Nonparametric Feature Impact and Importance” https://arxiv.org/pdf/2006.04750.pdf that estimates feature importance and impact. Python package at github.com/parrt/stratx.

2 Likes

Hi Pak Great Post.
Can’t wait to try it out when I build a tabular model!.
Cheers mrfabulous1 :smiley: :smiley:

It seems that a lot of people simply disregard the required assumption of feature independence. What are the risks? How does multicollinearity affect the feature importance scores? If I know that a few of my features are correlated, is there a way I can interpret the feature importance scores better because of that knowledge? IE will only the correlated features have incorrect scores, or will all features have incorrect scores in the presence of multicollinearity of some subset of the features?

These are general questions, but I’m using SHAP specifically. For further details or if you want to contribute to the discussion over at the github for SHAP, see my post there.

1 Like

My $0.02: Through folks that have domain-expertise, we’ve seen that the results of the feature importance make sense. For instance if I was looking at salary I would expect their job title to be a very important feature, which we’ve seen is on the adults dataset.

In regards to multicollinearity, you could do so by doing permutation importance on nxn features, but that is a giant rabbit hole. Along with this, generally what I have seen is if features are indeed tightly correlated, I’ll notice a stark change in the relative feature importance once removed, and in my work they tend to be grouped together in the rankings (IE very similar in rank).

I’ve also opened this question up to the twitter-verse, to see what spits back. I’ll update this post with anything interesting they say :smiley:

Update 1: I was given 3 papers about the shapley library (different from SHAP apparently!) @jaxondk:

4 Likes

Thanks for the quick response! And for reaching out to the twitter-verse :wink:

When you say:

I’ll notice a stark change in the relative feature importance once removed

what do you mean? Are you saying that while performing permutation importance, if a feature gets removed that is highly correlated with another feature it will get a large feature importance score? Not sure what you mean by “once removed”.

Update 1: Yes, shapley values and shap values are somewhat different. SHAP is a library that has some optimized approximations of shapley values for various types of models, one of which are neural nets.

PS - what is your twitter handle?

1 Like

So say workclass and living location are two variables, and (for the sake of exercise) they’re highly correlated. It would be my expectation that during feature importance if I were to shuffle living location the overall % would decrease, and vice-versa I would (presumably) see that once workclass was shuffled they’d shift by the same % or so. (this may be naivety too :slight_smile: )

Twitter handle is @TheZachMueller

OK, so your expectation (which you say may or may not be naive) is that correlated features will have similar global feature importance scores (or rankings).

Unfortunately, I don’t seem to be seeing that in my case for all of my models. I have 3 features that are >=.96 correlated (pearson coefficient) with each other.

For one of the models, they apparently aren’t used at all. But on another model, they are, but they have varying importances. On their global score (using SHAP values from DeepSHAP), one of them is ranked high (2nd highest), and the other 2 are fairly low.

This is concerning to me and I’m not sure how to work around the required assumption of feature independence. Not sure if I can trust the rankings for any of them, or if I just can’t trust the ones that are actually correlated.

I’m just skimming through some of those papers. “True to the Model or True to the Data” focuses on local interpretations, and right now at least I’m more concerned with global explanations. But that paper does link to Understanding Global Feature Contributions, which supposedly introduces a new method called SAGE that can robustly handle correlated features and complex feature interactions.

I’ll continue to update with things I learn

1 Like

Here are my notes on just a couple of these papers as well as some very useful threads/comments in various locations. Please realize that I skimmed these papers fairly quickly, and I also don’t have a PhD and am fairly removed from academia, so I don’t read massive amounts of papers on a regular basis. All that being said, I could totally be misinterpreting things and would love for someone to correct me on anything here :slight_smile: I feel fairly confident about the first one, and honestly barely looked at the second but included a small blurb about it anyway. I could be oversimplifying things in my conclusion section, so please provide some feedback so I don’t mislead the community or myself :laughing:

True to the Model or True to the Data

Discusses observational vs interventional conditional expectations when doing feature importance, and how neither is preferred generally but rather it’s application-specific, and depends on if you want to be “true to the model” or “true to the data”. Things like permutation importance and the SHAP approximations in DeepSHAP are interventional (seems Lundberg, author of shap, agrees), or “true to the model”.

The paper states that if you have independent features, importance values are the same for observational and interventional. Otherwise, “correlation splits the Beta as credit between correlated variables and higher levels of correlation leads to slower convergence of the observational Shapley value estimates.” However, I don’t know if this means they split it evenly between the correlated features. If it weren’t split evenly, then you may have one correlated feature look more important than the other features it’s correlated with disproportionately. However, this splitting only occurs if doing observational. This is because in interventional, it only shows you what the model is actually using. If it only uses one of the correlated features, it won’t split the credit between the other correlated features.

Conclusion: Essentially, “true to the model” = interventional, “true to the data” = observational. “Being true to the model is the best choice for most applications of explainable AI, where the goal is to explain the model itself.” However, if you are “focused on scientific discovery”, you will likely want to be true to the data.
The authors note that in an ideal world, you can reparameterize your model to get at the underlying independent factors. Then you can use interventional techniques without getting off the data manifold.

Understanding Global Feature Contributions

I ended up not spending too much time on this one. I was excited about it, as it seemed to be a method for handling correlated features when doing global feature importance, but then read this:

In practice we sample from the marginal distribution, which corresponds to an assumption of
feature independence

I had thought this paper was about dealing with feature dependence. I stopped reading at this point, though to be honest I’m sure I’m missing the point on this one.

Scouring various threads

  • “SHAP aligns with casual interventional perturbations” (Lundberg). Honestly, that whole thread was very useful
  • In the presence of correlated features, you cannot be both true to the data and true to the model. Either you must provide some data as inputs that are off-manifold (if you use interventional), or you must allow credit to bleed between correlated features (if you use observational). This is also the main gist of the “true to model / true to data” paper. (Lundberg)
  • This comment discusses some of the differences between observational and interventional, and also provided links to some of these other threads.

The Takeaway

This was really helpful for me. I would suspect that most practitioners here are more concerned about being true to the model. If that’s the case, it seems that things like permutation importance or SHAP values will work for you, even in the presence of correlated features. You will just see what the model used, regardless of multicollinearity.

For those more interested in scientific discovery, it looks like observational shapley values are what you want. Currently, I’m unaware of a package for calculating these for neural nets. Maybe SHAP has this functionality optionally that I’m unaware of. I would be very interested in learning about a package to compute these, as I am involved in both simply explaining a model and scientific discovery.

5 Likes