Feature importance in deep learning

Sometimes it may skip one or two but in my tests I’ve noticed it operate the same with tabular data (no skipping found)… I haven’t quite finished working on it yet, as I have a lot on my plate now. I will get to it very soon though!

Maybe, when I will find some time, I will try to compare your method of substituting validation set in learn.data with method of manually applying .predict() to each row. If it will produce the same result…

1 Like

Here’s a notebook where I explored just that :slight_smile: (If I missed something in there let me know. Or if you see any mistakes. It’s currently 5am and haven’t slept so a weary once-over may have missed something)

1 Like

@Pak I’m wondering if it had to do with the databunch generation itself (I haven’t looked into this yet it’s just an idea). For instance if we make a non-split databunch, is the order made the same as our original dataframe?

Edit: Confirmed it does.

Edit 2: I see the issue, or a hint to an issue. Say I call learn.get_preds() which will go to the validation dataset. We get an array of predictions, with the second item in that array being c2i indexs. I am not seeing the same prediction being generated at all, despite the confidence region being well above 80% in most cases. Part of that could be a differentiation in the model itself, but when I run learn.get_preds() multiple times I notice a large amount of changes in those predictions. Meanwhile learn.predict() always gives me the same output.

1 Like

Great point. I should definitely test my approach on applying testset to model with what you just say, if it outputs consistent results (but as I remember I was checking it with learn.predict() )
Upd 1: I’ve tested, it doesn’t, there must be some errors in my approach :frowning: will dive into it further
Upd 2: Something weird has just happened. I have tested the difference in results in .get_preds() and .predict() (there was one). And suddenly I was getting the same result out of nowhere. I can confirm that it’s not a change in code, cause I did dot edited it, I append new code to the notebook. Then I even reran my first experiments and it worked too. I have no ide what is going on. It looks like library updated itself (as progressbar also starting to work) I did not do that manually. Maybe your case is magically starts to work too :slight_smile:
Upd 3: I’ve figured out why my approach stopped working. Fastai changed how it deals with the last layers, so I had to update my code too. Now I get the same results with all 3 functions .get_preds(), .predict() and my own get_cust_preds()

1 Like

@Pak see the discussion here:

Turns out I was missing a step!

Hi again.
I have managed to make some experiments with my Rossmann notebook (updated it as well). And I’ve noticed that you probably were right the relative feature importance values (column permutation vs retrain methods) between different features in my notebook are really comparable. I was confused with absolute values, but if I normalize it, numbers will tell a different story (which by the way I have noticed only after I’ve plotted FI :frowning: )
My thoughts on this for now are the following:
Which to choose is a trick question. On the one hand naive (sorry I will call that, now I don’t thinks that it’s naive, but that’s how it is called in my notebook, so historical reasons I should say :slight_smile: it is really column permutation) method is waaaaaaaay faster. On the other – it depends what you mean on the word importance .
If we would have every feature as a separate entity, not related to one another, I would expect results to be much more similar (and I would definitely recommend naive method), but in real life they hardly ever do. In real life we have a mess of interconnected (as well as created by ourselves, derivative) features. And what do we really want to know? How our current model ranks features one to another relative to depended variable or how much unique info this current feature holds. I say, it depends. I see cases when first option will be better and some for the second one (at least the one where we try to eliminate redundant features).
So I think these two methods just answer two slightly different questions on the importance topic. I think, giving enough time, I would probably use both of them to get some insights

1 Like

@Pak thanks for being so thorough with this! I agree, both are doable and just depends on the budget (money and time) that you can accommodate for the methods, as the column permutation method is designed to look at what the model is looking at the most, whereas full retraining goes into what the model can find the most useful. Both are cut from the same cloth to some degree. But I agree that both could and should be done. The columnar permutation can help explain the models behavior quickly as well!

I’ll also drop this here. Terence Parr just released a new paper discussing Stratification Approach to Partial Dependence for Codependent Variables

Source code is here:

FYI, here’s how I got at least the model-agnostic KernelExplainer from shap to run in a notebook without errors on a tabular learner (model/data on gpu with both categorical and continuous variables):

# learn = tabular_learner(...)
# learn.fit_one_cycle(...)

import numpy as np
import pandas as pd
import shap
shap.initjs()

def pred(data):
    device = learn.data.device
    cat_cols = len(learn.data.train_ds.x.cat_names)
    cont_cols = len(learn.data.train_ds.x.cont_names)
    x_cat = torch.from_numpy(data[:, :cat_cols]).to(device, torch.int64)
    x_cont = torch.from_numpy(data[:, -cont_cols:]).to(device, torch.float32)
    pred_proba = learn.model(x_cat, x_cont).detach().to('cpu').numpy()
    return pred_proba

def shap_data(data):    
    X_train, y_train = data.one_batch(denorm=False, cpu=False)
    X_test, y_test = data.one_batch(denorm=False, cpu=False)
    cols = data.train_ds.x.col_names
    X_train = pd.DataFrame(np.concatenate([v.to('cpu').numpy() for v in X_train], axis=1), columns=cols)
    X_test = pd.DataFrame(np.concatenate([v.to('cpu').numpy() for v in X_test], axis=1), columns=cols)
    return X_train, X_test

X_train, X_test = shap_data(learn.data)
e = shap.KernelExplainer(pred, X_train)
shap_values = e.shap_values(X_test, nsamples=100, l1_reg=False)
shap.force_plot(e.expected_value[0], shap_values[0], X_test)

shap.summary_plot(shap_values, X_test, plot_type="bar")

This grabs two batches from the training set as X_train and X_test for shap.

2 Likes

Sadly google colab doesn’t support the Javascript library it looks like :frowning:

have you seen this: https://github.com/slundberg/shap/issues/279#issuecomment-427240107. the js should work, just needs to be initialised in every cell that produces a visual output.

1 Like

Hi all. Has ANYONE gottten SHAP DeepExplainer to work with FastAI Tabular DataBlock? It seems the formats expected by SHAP are PyTorch primitives and different of course than FastAI wrappers. SHAP seems to be a wonderful approach for some interpretability of the NN. Or is there any upcoming extension to FastAI to provide this sort of functionality. It seems fairly easy to do with PyTorch and Keras as well but if anyone has this working with FastAI please let me know. Thanks!

I’ve ported shap to fastai2: https://github.com/muellerzr/fastinference

6 Likes

Was the subject of SHAP assumptions about the feature independence discussed?
In tabular data it is quite an issue.

Have any of you read papers behind GradientExplainer and/or DeepExplainer and come to clear conclusion that we can use them without the independence assumption?

In Some Baselines for other Tabular Datasets with fastai2 we slit the topic to model interpretation, so I am bringing the conversation here, to make it easier to find and maybe more people will express their thoughts on the subject.

This is a response to the question How to conduct feature importance without assumption of feature independence?

We can extract FI without assuming feature independence with the attention from TabNet, SHAP explainers which do not use interventions (which might be GradientExplainer - sorry I haven’t checked that yet), @nestorDemeure’s idea or in general methods learning the feature importance during the training. In the future maybe the next update of SHAP will be resistant to the problem, because in the last paper (1911.11888) they describe improvements, but it’s not clear to me.

Any other thoughts?

1 Like

@hubert.misztela I guess the one question I still have is:

Do we still have this independence issue with our permutation importance (with the raw values not SHAP) I’d assume we would but I just want to be sure. Because if so then our FI tells us absolutely nothing then, no?

If features are correlated, then permutation importance can give biased results. In Interpretable Machine Learning, Christoph Molnar mentions discusses this in the feature importance chapter, particularly in the disadvantages section.

Unless otherwise stated, I’d expect the assumption of feature independence to be a requirement in any method that involves holding some features constant while modifying other features. TreeSHAP doesn’t always have this assumption, although it looks like certain output types do require setting feature_dependence=”independent”.

In regression analysis, feature independence (or in statistics terms: a lack of multicollinearity between independent variables, predictors, or covariates) is usually a required assumption as we are interested in interpreting the coefficients of the covariates. There are multiple methods for detecting multicollinearity, which we could use to check on our data.

Depending on the circumstances, multicollinearity isn’t always a problem. For example, through feature engineering or domain knowledge we might have a model whose inputs include age and age_squared, which by definition will be correlated with each other. In regression analysis we would always interpret the two coefficients together and never independently. For our tabular neural networks, we’d want to do the same, so perhaps we’d modify permutation importance to always permutate age and age_squared together. Likewise if we have a interaction term or other combinations of features.

Small amounts of multicollinearity between features we want to be independent might not be completely problematic [1]. The real world is messy, and practitioners don’t always have ideal data. Unfortunately, there are no hard and fast rules on what counts as acceptable multicollinearity, but various rules of thumb. An example, if we are modeling children’s health age, weight, and height are probably going to be correlated with each other. But if the correlation isn’t too large, we can still look at their feature importance assuming we are careful in our reporting and interpretation, and if we recognize and acknowledge that our results might be biased. Or, depending on the method used, we could treat them as control variables and limit our analysis to other features.

From a statistical practitioner’s perspective, if you want to interpret the feature importance of tabular neural networks I’d recommend this non-exhaustive list:

  1. Start by plotting a pairwise plot and correlation matrix of all the data. This is more of an eyeball test for collinearity, as it can only reveal pairwise correlation, not multicollinearity.
  2. Normalize the data. Data normalization can remove certain types of collinearity. Keep in mind that domain knowledge might suggest something other than straight normalization. For example, when working with economics data the natural log of income often more useful for interpretation than normalized income.
  3. Run at least one multicollinearity test. Preferably multiple. A non-exhaustive list of options includes variance inflation factor (VIF), the Farrar–Glauber test, perturbing the data, and conditional number test. Of these, only VIF appears to have a python implementation in statsmodels, the rest have R packages. Be careful with VIF in statsmodels, as it appears by default it doesn’t include a constant term so you’ll need to add another column to your data filled with ones.
  4. Remember that some forms of multicollinearity are not deal breakers if handled correctly. This would depend on what type collinearity and what type of feature importance analysis being applied.
  5. Even if all the statistical tests look good, there could still be undetected multicollinearity. So always be careful when presenting results.

Keep in mind that even with completely independent features, there are other factors that could bias feature interpretation. Some examples include omitted-variable bias and dealing with repeated measurements from a longitudinal study (measuring patients over time) or measurements made on clusters of related items (studying students in schools).

Any feature importance package, or addon to fast.ai, should clearly mention the assumption of feature independence if required.

Let me know if you have any questions or corrections.


  1. Somewhere there is probably a statistics theorist disagreeing with this statement :blush: ↩︎

6 Likes

Thank you both, this has helped me understand why TabNet is such a big thing (and a better understanding on the bigger issues with a FCNN for tabular interpretibility ). I appreciate the thorough thought into both, it’s given me much to think about :slight_smile:

Here is a idea I suggested in the other thread:

Given a trained model f(x0,...,xn) , we might be able to learn the feature importance by simulating the attention.

For each feature xi , we can associate an uninformative value mi (the mean or the mean of the embedding). We can then create a dampened feature xi' = ai*xi + (1-ai)*mi with the sum of all ai equal to 1 (this can be enforced by a sofmax).

We now maximize the accuracy of f(x0',...,xn') by optimizing the ai , these are our importances.
(that’s a way to ask to network to make a decision : which features can be dropped and which features bring information to the plate ?)

There is one hidden parameter here which is the fact that the ai sum to one : I see no particular reason to use 1 and not another number between 1 and n-1 (given n features).
Let’s call this parameter k, in a way it is what we think is the number of important features (my gut feeling would be that sqrt(n) should be a good default).

I think that another way to deal with k would be to sum each weight over all possible values of k from 1 to n-1.

@hubert.misztela To answer you questions, I have not implemented it (I had the idea as I was typing it and I don’t have enough time in the short term to build a prototype) and this would indeed be about feature importance and not sample specific quantities.
I believe that, as it is written, it would focus on features important for most samples and not features rarely important.
It might even be possible to add self attention to select different features for different samples. We would then have all the interpretability properties of an attention based model but for arbitrary tabular models.

@Bwarner’s warnings are important but do note, however, that the vast majority of methods suppose feature independence (which is false more often than not) and that they work well enough nevertheless.

(I work with people doing exactly that when analysing simulation codes)

1 Like