Feature importance in deep learning

I’ve only used it for tree based models but there is the SHAP package in by Scott Lundberg which has functionality for neural networks or any other black box model. It’s based on game theory principles and for tree based models this is what I use religiously. Native feature importance metrics in tree based models have shortcomings and what’s nice about SHAP is that a individual prediction can be deconstructed as opposed just have a global feature attribution.

1 Like

Thank you, @bernd.heidemann, This will help a lot for explaining the model. I also want something that will explain a single prediction, in the NLP context. For instance, if the model is state-of-the-art, overall, but has a rare example of “catastrophic failure”, I want the medical provider to know whether their patient’s prediction is a catastrophic failure before they make clinical interventions on the basis of that one prediction.

We had a story when our ICU research scientist was trying to predict probability of death from clinical notes. The best feature in the model was the fact that the family was visiting the patient. Of course the kids are only going to fly in from the other side of the country if someone else has told them that the father is about to die. So that wasn’t the signal we wanted. If we could highlight that phrase “family visit”, then the provider will know to discount that prediction.

Thank you @Hannibal! I googled SHAP and Scott Lundberg, and this looks like great and relevant work, though it looked hard to implement. As @axelstram mentioned, Lundberg has a really great README at his github here:

In this README, he references “Deep SHAP” where he applies his method to Deep Learning, and implements it in Keras/TensorFlow. He also mentions that there is some preliminary support for PyTorch, and I found it deeper in the same github tree here:

I don’t understand how it all works, other than it seems to go quite a bit deeper than a simple “attention” model. For a look at how deep the rabbit hole goes, there is a very nice video from 2017 by the group that created a related method, “DeepLift”. Like Jeremy’s lessons, this video is very clear and very deep!

@danaludwig thanks for the video! Very interesting approach!

As far as I could see, it should be possible to use SHAP with a custom pytorch-model. But we would loose a little bit convenience, for example the automatically generated embeddings for Tabular Data…

1 Like

This is exactly what I was looking for as I was having difficulties with the previous one due to my loss function. Thank you!!!

Just going to tag on this a little bit, I re-fractured the code to where now you just need to input the learner.

def feature_importance(learner): 
  # based on: https://medium.com/@mp.music93/neural-networks-feature-importance-with-fastai-5c393cf65815
    data = learner.data.train_ds.x
    cat_names = data.cat_names
    cont_names = data.cont_names
    loss0=np.array([learner.loss_func(learner.pred_batch(batch=(x,y.to("cpu"))), y.to("cpu")) for x,y in iter(learner.data.valid_dl)]).mean()
    fi=dict()
    types=[cat_names, cont_names]
    for j, t in enumerate(types):
      for i, c in enumerate(t):
        loss=[]
        for x,y in iter(learner.data.valid_dl):
          col=x[j][:,i]    #x[0] da hier cat-vars
          idx = torch.randperm(col.nelement())
          x[j][:,i] = col.view(-1)[idx].view(col.size())
          y=y.to('cpu')
          loss.append(learner.loss_func(learner.pred_batch(batch=(x,y)), y))
        fi[c]=np.array(loss).mean()-loss0
    d = sorted(fi.items(), key=lambda kv: kv[1], reverse=True)
    return pd.DataFrame({'cols': [l for l, v in d], 'imp': np.log1p([v for l, v in d])})
7 Likes

@bernd.heidemann I’m just now taking a udemy course on feature selection in hopes to get a bit better with this. How would I go about implementing if two variables are in high correlation with each other? Scramble two columns at the same time instead of one? (Double permutation)?

Thanks!!!

@muellerzr I have no idea if Double Permutation will yield good results, but i would try on a well kown dataset. Maybe combined partial dependence plots would also be a woth a try.

@bernd.heidemann Does your code for the partial dependency work still above? When I try it on the ADULTs dataset diff is always zero regardless of the variable or type

@muellerzr Wanted to say thank you for this. It’s really fantastic.

@johnkeefe absolutely! I also updated the code a bit more, as I wasn’t quite satisfied. This one now has a progress bar to know which variable (out of the total #) you are at, along with another column for that particular variables type (as this was something I found quite confusing to go back and forth on).

def feature_importance(learn:Learner): 
    pd.options.mode.chained_assignment = None
    # based on: https://medium.com/@mp.music93/neural-networks-feature-importance-with-fastai-5c393cf65815
    data = learn.data.train_ds.x
    cat_names = data.cat_names
    cont_names = data.cont_names
    loss0=np.array([learn.loss_func(learn.pred_batch(batch=(x,y.to("cpu"))), y.to("cpu")) for x,y in iter(learn.data.valid_dl)]).mean()
    #The above gives us our ground truth percentage for our validation set
    fi=dict()
    types=[cat_names, cont_names]
    with tqdm(total=len(data.col_names)) as pbar:
      for j, t in enumerate(types): # for all of cat_names and cont_names
        for i, c in enumerate(t):
          loss=[]
          for x,y in (iter(learn.data.valid_dl)): # for all values in validation set
            col=x[j][:,i] # select one column of tensors
            idx = torch.randperm(col.nelement()) # generate a random tensor
            x[j][:,i] = col.view(-1)[idx].view(col.size()) # replace the old tensor with a new one
            y=y.to('cpu')
            loss.append(learn.loss_func(learn.pred_batch(batch=(x,y)), y))
          pbar.update(1)
          fi[c]=np.array(loss).mean()-loss0
    d = sorted(fi.items(), key=lambda kv: kv[1], reverse=True)
    
    df = pd.DataFrame({'Variable': [l for l, v in d], 'Importance': np.log1p([v for l, v in d])})
    df['Type'] = ''
    for x in range(len(df)):
      if df['Variable'].iloc[x] in cat_names:
        df['Type'].iloc[x] = 'categorical'
      if df['Variable'].iloc[x] in cont_names:
        df['Type'].iloc[x] = 'continuous'
    return df

4 Likes

One other important that I only realized through learning (we all learn together right?) I was looking through documentation on Permutation Importance, see here: https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html

"To avoid re-training the estimator we can remove a feature only from the test part of the dataset, and compute score without using this feature. "

And I realized we actually need to do this on the test set. My other posts have links to what a gradable test set looks like, if anyone can’t find it I’ll post one. But I noticed a very key difference on my feature selection. Now the values actually make justifiable sense to me. Doing this I saw Importance have much more reasonable losses. The solution that I am using now is I pass in a labeled LabelList as test, and on the declaration for loss0, I change iter(data.valid_dl) to iter(test.train_dl)

Overall the features were not different, however it is better practice this way and generally more accepted.

However, I think I may move to implementing something like RFE, as a comparison in case there is relations within certain variables

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn-feature-selection-rfe

I’ve also worked with these problems (feature importance, partial dependence and etc) here post and here the notebook for Rosmann data
As I found out on my data just column permutation can be misleading.
Here is a quote from my notebook:

Wonderful, now we know it (not really) we can move on to Partial Dependence (no)

The first point that hinted me that it is not ok to do that with NN with embeddings was the crazy difference in importance between features (in my other case it was even bigger)
I was looking at the data and could not believe how the features that must be pretty important for the case are in the bottom of this list (it was the case from field I knew a lot about)
And then I noticed that pretty all important features are categorical columns and visa versa And when I (using editable installs) shifted max embedding size to 10 (from up to 600) this gap became much less.
So it became pretty clear for me why embeddings (categorical columns) seem to be more valuable. Each continuous variable is presented with 1 float number. And each categorical – with a vector of several dozens. And when we randomize categorical column, we mess with tens of columns rather with one. Which, obviously, is more harmful for accuracy

What do we do? Will will use the next (much more computational expensive) option.
I sadly present you the process which involves retraining NN for each column (group of columns)

The idea is very simple. We just throw away the column completely, retrain the NN and compare the errors

Maybe it can help

1 Like

Thank you for the insight @Pak! Fascinating! I hadn’t noticed this as my research is entirely categorical data. Perhaps that makes a difference? As the results I’m getting do make sense in the context of what I am studying. Perhaps it’s an isolated incident where if you only have one or the other variable types, it won’t matter as much. Thoughts?

I also have some thoughts on your findings, let me try a few things.

Yes. I think you are right. We deal here with relative importance of different kinds of features. So if you use categorical only, it should work ok.
But after that finding :slight_smile: I’ve pretty lost faith in permutations and now I believe mire in retraining (and multiple retarining for sureness it appeared to be not as comp expensive as it seemed)

I have a few ideas I want to compare, I’ll do the ADULTs dataset just to be sure, as I’m not wholly convinced on the logic yet (just me being feisty) :wink: but I will let you know.

Yes, that will be great. My thoughts are based on two sets only
One thing is: I have some concerns I don’t know if adult dataset is wide (In terms of number of features) enough to explore the difference
But it would be wonderful to know how it worked out

Do you have a dataset that’s wide enough you’d recommend? Because ideally with the permutation importance done above, the only thing that should be replaced is the tensor mapping that is generated through the embedding matrix. And that is all. That is why I’m confused on the logic and want to see it myself.

I really don’t know maybe adult is good enough. I used Rosmann data, but really now I see how adult is much more interpretable by a human for santity check. As Rosmann’s features are mystery for non-expert (Im also non-expert in sales :slight_smile: )