Feature importance in deep learning

Perhaps it was due to the procs not matching up. If you need that, that is why the processor = data.processor exists

I’ll rerun the above on Rossmann later today

1 Like

@Pak running it now, I wasn’t able to get to it yesterday. However I did run it today. My results differed from yours a bit. Here is my new function, where we take in a test dataframe and shuffle each column one at a time and validate over it:

import copy

def feature_importance(learn:Learner, cats:list, conts:list, dep_var:str, test:DataFrame):
  data = learn.data.train_ds.x
  procs = data.procs
  cat, cont = copy.deepcopy(cats), copy.deepcopy(conts)
  if 'CrossEntropyLoss' in str(learn.loss_func):
    dt = (TabularList.from_df(test, path='', cat_names=cat, cont_names=cont, 
                              procs=procs)
                             .split_none()
                             .label_from_df(cols=dep_var)
                             .databunch(bs=learn.data.batch_size))
  else:
    dt = (TabularList.from_df(test, path='', cat_names=cat, cont_names=cont, 
                              procs=procs)
                             .split_none()
                             .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
                             .databunch(bs=learn.data.batch_size))
    
  learn.data.valid_dl = dt.train_dl
  loss0 = float(learn.validate()[1])
  
  fi=dict()
  cat, cont = copy.deepcopy(cats), copy.deepcopy(conts)
  types = [cat, cont]
  for j, t in enumerate(types):
    for i, c in enumerate(t):
      print(c)
      base = test.copy()
      base[c] = base[c].sample(n=len(base), replace=True).reset_index(drop=True)
      cat, cont = copy.deepcopy(cats), copy.deepcopy(conts)
      if 'CrossEntropyLoss' in str(learn.loss_func):
        dt = (TabularList.from_df(base, path='', cat_names=cat, cont_names=cont, 
                              procs=procs)
                             .split_none()
                             .label_from_df(cols=dep_var)
                             .databunch(bs=learn.data.batch_size))
      else:
        dt = (TabularList.from_df(test, path='', cat_names=cat, cont_names=cont, 
                              procs=procs)
                             .split_none()
                             .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
                             .databunch(bs=learn.data.batch_size))
      learn.data.valid_dl = dt.train_dl
      fi[c] = float(learn.validate()[1]) - loss0
      
  d = sorted(fi.items(), key =lambda kv: kv[1], reverse=True)
  df = pd.DataFrame({'Variable': [l for l, v in d], 'Accuracy': [v for l, v in d]})
  df['Type'] = ''
  for x in range(len(df)):
    if df['Variable'].iloc[x] in cats:
      df['Type'].iloc[x] = 'categorical'
    if df['Variable'].iloc[x] in conts:
      df['Type'].iloc[x] = 'continuous'
  return df                  

This allows for a very standard approach to the two default loss functions Fast.AI will use. My results were different than yours though. Anything negative was a negative impact on the training, so they were the best.

╔════════╦══════════════════════════╦═══════════╦═════════════╗
║ Number ║         Variable         ║ Accuracy  ║ Type        ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    0   ║       SchoolHoliday      ║ 0.001581  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    1   ║           trend          ║ 0.001569  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    2   ║     AfterStateHoliday    ║ 0.001444  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    3   ║           Month          ║ 0.001159  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    4   ║      StateHoliday_bw     ║ 0.001103  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    5   ║         trend_DE         ║ 0.001090  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    6   ║       Min_Humidity       ║ 0.001085  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    7   ║    Max_Wind_SpeedKm_h    ║ 0.000958  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    8   ║     Max_TemperatureC     ║ 0.000871  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 9      ║ StateHoliday             ║ 0.000795  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 10     ║ Min_TemperatureC         ║ 0.000791  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 11     ║ Events                   ║ 0.000748  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 12     ║ PromoInterval            ║ 0.000531  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 13     ║ Promo2Weeks              ║ 0.000477  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 14     ║ StoreType                ║ 0.000465  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 15     ║ Promo2SinceYear          ║ 0.000420  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 16     ║ Store                    ║ 0.000397  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 17     ║ Year                     ║ 0.000392  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 18     ║ CompetitionMonthsOpen    ║ 0.000334  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 19     ║ BeforeStateHoliday       ║ 0.000255  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 20     ║ State                    ║ 0.000107  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 21     ║ Assortment               ║ -0.000095 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 22     ║ Day                      ║ -0.000122 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 23     ║ Promo_bw                 ║ -0.000333 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 24     ║ CloudCover               ║ -0.000406 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 25     ║ Mean_TemperatureC        ║ -0.000516 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 26     ║ Promo                    ║ -0.001300 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 27     ║ SchoolHoliday_bw         ║ -0.001309 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 28     ║ Mean_Humidity            ║ -0.001415 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 29     ║ SchoolHoliday_fw         ║ -0.001569 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 30     ║ StateHoliday_fw          ║ -0.001817 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 31     ║ Week                     ║ -0.004419 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 32     ║ DayOfWeek                ║ -0.008283 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 33     ║ Max_Humidity             ║ -0.008312 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 34     ║ CompetitionDistance      ║ -0.008432 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 35     ║ CompetitionOpenSinceYear ║ -0.008464 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 36     ║ Mean_Wind_SpeedKm_h      ║ -0.008909 ║ continuous  ║
╚════════╩══════════════════════════╩═══════════╩═════════════╝

Store wound up being somewhere in the middle here, so perhaps I am doing something wrong?

Here are the results given the old function from earlier posts:

╔════════╦══════════════════════════╦═══════════╦═════════════╗
║ Number ║         Variable         ║ Accuracy  ║ Type        ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    0   ║    Mean_Wind_SpeedKm_h   ║ 0.000946  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    1   ║         Promo_bw         ║ 0.000924  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    2   ║           Promo          ║ 0.000844  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    3   ║           Store          ║ 0.000747  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    4   ║     SchoolHoliday_fw     ║ 0.000728  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    5   ║      Promo2SinceYear     ║ 0.000717  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    6   ║        Assortment        ║ 0.000653  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    7   ║         Promo_fw         ║ 0.000611  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║    8   ║      StateHoliday_fw     ║ 0.000428  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 9      ║ Day                      ║ 0.000400  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 10     ║ Max_Wind_SpeedKm_h       ║ 0.000358  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 11     ║ CompetitionDistance_na   ║ 0.000294  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 12     ║ Month                    ║ 0.000185  ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 13     ║ trend_DE                 ║ 0.000050  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 14     ║ BeforeStateHoliday       ║ 0.000014  ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 15     ║ SchoolHoliday            ║ -0.000037 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 16     ║ CompetitionMonthsOpen    ║ -0.000058 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 17     ║ StateHoliday             ║ -0.000058 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 18     ║ Max_Humidity             ║ -0.000077 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 19     ║ Mean_TemperatureC        ║ -0.000136 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 20     ║ StateHoliday_bw          ║ -0.000148 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 21     ║ StoreType                ║ -0.000163 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 22     ║ Mean_Humidity            ║ -0.000193 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 23     ║ SchoolHoliday_bw         ║ -0.000246 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 24     ║ DayOfWeek                ║ -0.000286 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 25     ║ trend                    ║ -0.000390 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 26     ║ Promo2Weeks              ║ -0.000517 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 27     ║ Min_Humidity             ║ -0.000906 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 28     ║ PromoInterval            ║ -0.000937 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 29     ║ Min_TemperatureC         ║ -0.001001 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 30     ║ CompetitionOpenSinceYear ║ -0.001064 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 31     ║ AfterStateHoliday        ║ -0.001515 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 32     ║ CloudCover               ║ -0.001570 ║ continuous  ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 33     ║ State                    ║ -0.002007 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 34     ║ Events                   ║ -0.002613 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 35     ║ Year                     ║ -0.003186 ║ categorical ║
╠════════╬══════════════════════════╬═══════════╬═════════════╣
║ 36     ║ CompetitionDistance      ║ -0.005161 ║ continuous  ║
╚════════╩══════════════════════════╩═══════════╩═════════════╝

Both of these are under the guise where importance is calculated by shuffled_accuracy - baseline_accuracy.

Let me know your thoughts.

1 Like

Hm… interesting results…
Problem with these dataset is in us :slight_smile: in the fact that we are not domain experts here. It’s hard to interpret is for ex. Mean_Wind_SpeedKm_h seems to affect sales much (first idea – it should not be at the top)… My intuition is that Store should be at the top (or top3), but it’s not a very educated guess.

Unfortunately I don’t have enough time to dig in here for how (maybe I can find some slot on a weekend).
But one process to check result I can come up with.

What if you take 4 distinct features from your feature importance table from different part of it (from top3, from bottom3 and 2 from the middle as far away from each other as possible).
Then you train your whole model (just by hand no automation) and output accuracy. Then you throw away one by one (and only one for the step) each of features (from the list of cont/cat features) you chose on a previous step and also save the accuracy (again manually with recreation databunch and learner each time, just to check that everything is ok on each step). And then by comparing accuracies you should be able to get the ‘real’ relative feature importance of these four features. The order should match the order in your FI by created shuffling. If it’s competently different, that there is some problem with shuffling or it’s implementation
Oh and I’ve compared accuracies on validation sets. I’ve split initial dataframe in two (train and validation), just to be sure that I’m always compare the same data (fixed valid set). I do remember I’ve made a separate parameter if I want to check accuracy on the whole dataset (tarin+valid) or on valid only. And I don’t really remember why I ended up using only valid accuracies comparison.

And I’ve remembered one sanity check I’ve made there. I make a fake feature (or as I remember there was one in data) on that depended var in not, you know,… depended (for ex. it has the same value for every row). And after you apply your algorithm it’s feature importance should be close to zero, if it’s not it a is a big prove that there is something wrong.
Whats the only thing I can came up with – try to add some checks on each step as we are not able to determine if FI works fine just by looking at the output table (I can’t :frowning: ).
Hope that will help :slight_smile:
Sorry for my very messy and unstructured thoughts and my ugly English (it’s obviously not my first language). I know that in English I sound not as polite as it should be but it’s just an inherit features of Russian language (not to sound polite enough) :slight_smile:

1 Like

And there is one thing I did not quite understand why there are two tables of FI ?

Hi, did you have any progress there?
Also one thought came across to me considering this topic. Yesterday i was trying to implement (naive) partial dependency for text classification problem (substituting each word in piece of text with unknown, I was monitoring how probability of each class shifted). And I’ve also (along with many on this forum, there are a lot of topics about it like this) noticed that prediction bunch as a whole doesn’t work the same as prediction one by one (.predict() vs .get_preds()). It is weird. And I thought maybe this magic with substituting dataloader is not working as it intend too like learn.data.valid_dl = dt.train_dl (in .get_preds() something like this is used – .add_test() and substituting data object if I remember correct). Just one crazy thought…

Sometimes it may skip one or two but in my tests I’ve noticed it operate the same with tabular data (no skipping found)… I haven’t quite finished working on it yet, as I have a lot on my plate now. I will get to it very soon though!

Maybe, when I will find some time, I will try to compare your method of substituting validation set in learn.data with method of manually applying .predict() to each row. If it will produce the same result…

1 Like

Here’s a notebook where I explored just that :slight_smile: (If I missed something in there let me know. Or if you see any mistakes. It’s currently 5am and haven’t slept so a weary once-over may have missed something)

1 Like

@Pak I’m wondering if it had to do with the databunch generation itself (I haven’t looked into this yet it’s just an idea). For instance if we make a non-split databunch, is the order made the same as our original dataframe?

Edit: Confirmed it does.

Edit 2: I see the issue, or a hint to an issue. Say I call learn.get_preds() which will go to the validation dataset. We get an array of predictions, with the second item in that array being c2i indexs. I am not seeing the same prediction being generated at all, despite the confidence region being well above 80% in most cases. Part of that could be a differentiation in the model itself, but when I run learn.get_preds() multiple times I notice a large amount of changes in those predictions. Meanwhile learn.predict() always gives me the same output.

1 Like

Great point. I should definitely test my approach on applying testset to model with what you just say, if it outputs consistent results (but as I remember I was checking it with learn.predict() )
Upd 1: I’ve tested, it doesn’t, there must be some errors in my approach :frowning: will dive into it further
Upd 2: Something weird has just happened. I have tested the difference in results in .get_preds() and .predict() (there was one). And suddenly I was getting the same result out of nowhere. I can confirm that it’s not a change in code, cause I did dot edited it, I append new code to the notebook. Then I even reran my first experiments and it worked too. I have no ide what is going on. It looks like library updated itself (as progressbar also starting to work) I did not do that manually. Maybe your case is magically starts to work too :slight_smile:
Upd 3: I’ve figured out why my approach stopped working. Fastai changed how it deals with the last layers, so I had to update my code too. Now I get the same results with all 3 functions .get_preds(), .predict() and my own get_cust_preds()

1 Like

@Pak see the discussion here:

Turns out I was missing a step!

Hi again.
I have managed to make some experiments with my Rossmann notebook (updated it as well). And I’ve noticed that you probably were right the relative feature importance values (column permutation vs retrain methods) between different features in my notebook are really comparable. I was confused with absolute values, but if I normalize it, numbers will tell a different story (which by the way I have noticed only after I’ve plotted FI :frowning: )
My thoughts on this for now are the following:
Which to choose is a trick question. On the one hand naive (sorry I will call that, now I don’t thinks that it’s naive, but that’s how it is called in my notebook, so historical reasons I should say :slight_smile: it is really column permutation) method is waaaaaaaay faster. On the other – it depends what you mean on the word importance .
If we would have every feature as a separate entity, not related to one another, I would expect results to be much more similar (and I would definitely recommend naive method), but in real life they hardly ever do. In real life we have a mess of interconnected (as well as created by ourselves, derivative) features. And what do we really want to know? How our current model ranks features one to another relative to depended variable or how much unique info this current feature holds. I say, it depends. I see cases when first option will be better and some for the second one (at least the one where we try to eliminate redundant features).
So I think these two methods just answer two slightly different questions on the importance topic. I think, giving enough time, I would probably use both of them to get some insights

1 Like

@Pak thanks for being so thorough with this! I agree, both are doable and just depends on the budget (money and time) that you can accommodate for the methods, as the column permutation method is designed to look at what the model is looking at the most, whereas full retraining goes into what the model can find the most useful. Both are cut from the same cloth to some degree. But I agree that both could and should be done. The columnar permutation can help explain the models behavior quickly as well!

I’ll also drop this here. Terence Parr just released a new paper discussing Stratification Approach to Partial Dependence for Codependent Variables

Source code is here:

FYI, here’s how I got at least the model-agnostic KernelExplainer from shap to run in a notebook without errors on a tabular learner (model/data on gpu with both categorical and continuous variables):

# learn = tabular_learner(...)
# learn.fit_one_cycle(...)

import numpy as np
import pandas as pd
import shap
shap.initjs()

def pred(data):
    device = learn.data.device
    cat_cols = len(learn.data.train_ds.x.cat_names)
    cont_cols = len(learn.data.train_ds.x.cont_names)
    x_cat = torch.from_numpy(data[:, :cat_cols]).to(device, torch.int64)
    x_cont = torch.from_numpy(data[:, -cont_cols:]).to(device, torch.float32)
    pred_proba = learn.model(x_cat, x_cont).detach().to('cpu').numpy()
    return pred_proba

def shap_data(data):    
    X_train, y_train = data.one_batch(denorm=False, cpu=False)
    X_test, y_test = data.one_batch(denorm=False, cpu=False)
    cols = data.train_ds.x.col_names
    X_train = pd.DataFrame(np.concatenate([v.to('cpu').numpy() for v in X_train], axis=1), columns=cols)
    X_test = pd.DataFrame(np.concatenate([v.to('cpu').numpy() for v in X_test], axis=1), columns=cols)
    return X_train, X_test

X_train, X_test = shap_data(learn.data)
e = shap.KernelExplainer(pred, X_train)
shap_values = e.shap_values(X_test, nsamples=100, l1_reg=False)
shap.force_plot(e.expected_value[0], shap_values[0], X_test)

shap.summary_plot(shap_values, X_test, plot_type="bar")

This grabs two batches from the training set as X_train and X_test for shap.

2 Likes

Sadly google colab doesn’t support the Javascript library it looks like :frowning:

have you seen this: https://github.com/slundberg/shap/issues/279#issuecomment-427240107. the js should work, just needs to be initialised in every cell that produces a visual output.

1 Like

Hi all. Has ANYONE gottten SHAP DeepExplainer to work with FastAI Tabular DataBlock? It seems the formats expected by SHAP are PyTorch primitives and different of course than FastAI wrappers. SHAP seems to be a wonderful approach for some interpretability of the NN. Or is there any upcoming extension to FastAI to provide this sort of functionality. It seems fairly easy to do with PyTorch and Keras as well but if anyone has this working with FastAI please let me know. Thanks!

I’ve ported shap to fastai2: https://github.com/muellerzr/fastinference

6 Likes