Feature importance in deep learning

muellerzr · May 28, 2019, 7:41pm

Just going to tag on this a little bit, I re-fractured the code to where now you just need to input the learner.

def feature_importance(learner): 
  # based on: https://medium.com/@mp.music93/neural-networks-feature-importance-with-fastai-5c393cf65815
    data = learner.data.train_ds.x
    cat_names = data.cat_names
    cont_names = data.cont_names
    loss0=np.array([learner.loss_func(learner.pred_batch(batch=(x,y.to("cpu"))), y.to("cpu")) for x,y in iter(learner.data.valid_dl)]).mean()
    fi=dict()
    types=[cat_names, cont_names]
    for j, t in enumerate(types):
      for i, c in enumerate(t):
        loss=[]
        for x,y in iter(learner.data.valid_dl):
          col=x[j][:,i]    #x[0] da hier cat-vars
          idx = torch.randperm(col.nelement())
          x[j][:,i] = col.view(-1)[idx].view(col.size())
          y=y.to('cpu')
          loss.append(learner.loss_func(learner.pred_batch(batch=(x,y)), y))
        fi[c]=np.array(loss).mean()-loss0
    d = sorted(fi.items(), key=lambda kv: kv[1], reverse=True)
    return pd.DataFrame({'cols': [l for l, v in d], 'imp': np.log1p([v for l, v in d])})

muellerzr · May 28, 2019, 8:14pm

@bernd.heidemann I’m just now taking a udemy course on feature selection in hopes to get a bit better with this. How would I go about implementing if two variables are in high correlation with each other? Scramble two columns at the same time instead of one? (Double permutation)?

Thanks!!!

bernd.heidemann · May 28, 2019, 9:28pm

@muellerzr I have no idea if Double Permutation will yield good results, but i would try on a well kown dataset. Maybe combined partial dependence plots would also be a woth a try.

muellerzr · June 1, 2019, 4:46am

@bernd.heidemann Does your code for the partial dependency work still above? When I try it on the ADULTs dataset diff is always zero regardless of the variable or type

johnkeefe · June 24, 2019, 1:20am

@muellerzr Wanted to say thank you for this. It’s really fantastic.

muellerzr · June 24, 2019, 4:47am

@johnkeefe absolutely! I also updated the code a bit more, as I wasn’t quite satisfied. This one now has a progress bar to know which variable (out of the total #) you are at, along with another column for that particular variables type (as this was something I found quite confusing to go back and forth on).

def feature_importance(learn:Learner): 
    pd.options.mode.chained_assignment = None
    # based on: https://medium.com/@mp.music93/neural-networks-feature-importance-with-fastai-5c393cf65815
    data = learn.data.train_ds.x
    cat_names = data.cat_names
    cont_names = data.cont_names
    loss0=np.array([learn.loss_func(learn.pred_batch(batch=(x,y.to("cpu"))), y.to("cpu")) for x,y in iter(learn.data.valid_dl)]).mean()
    #The above gives us our ground truth percentage for our validation set
    fi=dict()
    types=[cat_names, cont_names]
    with tqdm(total=len(data.col_names)) as pbar:
      for j, t in enumerate(types): # for all of cat_names and cont_names
        for i, c in enumerate(t):
          loss=[]
          for x,y in (iter(learn.data.valid_dl)): # for all values in validation set
            col=x[j][:,i] # select one column of tensors
            idx = torch.randperm(col.nelement()) # generate a random tensor
            x[j][:,i] = col.view(-1)[idx].view(col.size()) # replace the old tensor with a new one
            y=y.to('cpu')
            loss.append(learn.loss_func(learn.pred_batch(batch=(x,y)), y))
          pbar.update(1)
          fi[c]=np.array(loss).mean()-loss0
    d = sorted(fi.items(), key=lambda kv: kv[1], reverse=True)
    
    df = pd.DataFrame({'Variable': [l for l, v in d], 'Importance': np.log1p([v for l, v in d])})
    df['Type'] = ''
    for x in range(len(df)):
      if df['Variable'].iloc[x] in cat_names:
        df['Type'].iloc[x] = 'categorical'
      if df['Variable'].iloc[x] in cont_names:
        df['Type'].iloc[x] = 'continuous'
    return df

muellerzr · July 1, 2019, 3:03pm

One other important that I only realized through learning (we all learn together right?) I was looking through documentation on Permutation Importance, see here: https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html

"To avoid re-training the estimator we can remove a feature only from the test part of the dataset, and compute score without using this feature. "

And I realized we actually need to do this on the test set. My other posts have links to what a gradable test set looks like, if anyone can’t find it I’ll post one. But I noticed a very key difference on my feature selection. Now the values actually make justifiable sense to me. Doing this I saw Importance have much more reasonable losses. The solution that I am using now is I pass in a labeled LabelList as test, and on the declaration for loss0, I change iter(data.valid_dl) to iter(test.train_dl)

Overall the features were not different, however it is better practice this way and generally more accepted.

However, I think I may move to implementing something like RFE, as a comparison in case there is relations within certain variables

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn-feature-selection-rfe

Pak · July 1, 2019, 4:30pm

I’ve also worked with these problems (feature importance, partial dependence and etc) here post and here the notebook for Rosmann data
As I found out on my data just column permutation can be misleading.
Here is a quote from my notebook:

Wonderful, now we know it (not really) we can move on to Partial Dependence (no)

The first point that hinted me that it is not ok to do that with NN with embeddings was the crazy difference in importance between features (in my other case it was even bigger)
I was looking at the data and could not believe how the features that must be pretty important for the case are in the bottom of this list (it was the case from field I knew a lot about)
And then I noticed that pretty all important features are categorical columns and visa versa And when I (using editable installs) shifted max embedding size to 10 (from up to 600) this gap became much less.
So it became pretty clear for me why embeddings (categorical columns) seem to be more valuable. Each continuous variable is presented with 1 float number. And each categorical – with a vector of several dozens. And when we randomize categorical column, we mess with tens of columns rather with one. Which, obviously, is more harmful for accuracy

What do we do? Will will use the next (much more computational expensive) option.
I sadly present you the process which involves retraining NN for each column (group of columns)

The idea is very simple. We just throw away the column completely, retrain the NN and compare the errors

Maybe it can help

muellerzr · July 1, 2019, 4:34pm

Thank you for the insight @Pak! Fascinating! I hadn’t noticed this as my research is entirely categorical data. Perhaps that makes a difference? As the results I’m getting do make sense in the context of what I am studying. Perhaps it’s an isolated incident where if you only have one or the other variable types, it won’t matter as much. Thoughts?

muellerzr · July 1, 2019, 4:42pm

I also have some thoughts on your findings, let me try a few things.

Pak · July 1, 2019, 4:50pm

Yes. I think you are right. We deal here with relative importance of different kinds of features. So if you use categorical only, it should work ok.
But after that finding I’ve pretty lost faith in permutations and now I believe mire in retraining (and multiple retarining for sureness it appeared to be not as comp expensive as it seemed)

muellerzr · July 1, 2019, 4:56pm

I have a few ideas I want to compare, I’ll do the ADULTs dataset just to be sure, as I’m not wholly convinced on the logic yet (just me being feisty) but I will let you know.

Pak · July 1, 2019, 5:01pm

Yes, that will be great. My thoughts are based on two sets only
One thing is: I have some concerns I don’t know if adult dataset is wide (In terms of number of features) enough to explore the difference
But it would be wonderful to know how it worked out

muellerzr · July 1, 2019, 5:03pm

Do you have a dataset that’s wide enough you’d recommend? Because ideally with the permutation importance done above, the only thing that should be replaced is the tensor mapping that is generated through the embedding matrix. And that is all. That is why I’m confused on the logic and want to see it myself.

Pak · July 1, 2019, 5:09pm

I really don’t know maybe adult is good enough. I used Rosmann data, but really now I see how adult is much more interpretable by a human for santity check. As Rosmann’s features are mystery for non-expert (Im also non-expert in sales )

Pak · July 1, 2019, 5:17pm

Yes but when you replace index in embedding model get not just one number (index) but a vector of numbers (which represent this index and in can be up to 600 floats per one category) and when you replace continuous variable you only meds with one number. And one number change in general should affect less than changing tens of numbers.
That’s the logic behind my code. Maybe you will get other results from your experiments and it can lead to other interpretations of what is really going on there that we can discuss

muellerzr · July 1, 2019, 5:23pm

I believe it should be a look up value, so when we scramble the column we just scramble the one original input. Because all the embedding is doing is mapping a categorical variable to the separate vector of values but will still represent the one input value. So it shouldn’t matter. That is my take at least. I’ll be able to get to some small experiments later this evening too.

Pak · July 1, 2019, 6:19pm

Yes, now I think I’ve got your idea, that all the values in a vector represent one initial value (index/category), so amount of initial info is still one value.
It will be interesting to see how it goes.

muellerzr · July 1, 2019, 7:21pm

Exactly! The exception is when we deal with missing values, as FillMissing maps a binary ‘_na’ variable, but else exactly. I’m working on finding a particular article at the moment that should help clarify more for both of us (I found it awhile ago and it’s buried somewhere)

muellerzr · July 3, 2019, 3:39pm

@Pak, I’m working on the experimentation now, and I realized something. I know we use the loss function when we calculate it, but considering this is (for the most part) used with classification, would it not be better to show our metric instead? (eg. accuracy) as we can explain the why it works a bit easier and more direct?

In retrospect why not both I think