Feature importance in deep learning

@johnkeefe absolutely! I also updated the code a bit more, as I wasn’t quite satisfied. This one now has a progress bar to know which variable (out of the total #) you are at, along with another column for that particular variables type (as this was something I found quite confusing to go back and forth on).

def feature_importance(learn:Learner): 
    pd.options.mode.chained_assignment = None
    # based on: https://medium.com/@mp.music93/neural-networks-feature-importance-with-fastai-5c393cf65815
    data = learn.data.train_ds.x
    cat_names = data.cat_names
    cont_names = data.cont_names
    loss0=np.array([learn.loss_func(learn.pred_batch(batch=(x,y.to("cpu"))), y.to("cpu")) for x,y in iter(learn.data.valid_dl)]).mean()
    #The above gives us our ground truth percentage for our validation set
    fi=dict()
    types=[cat_names, cont_names]
    with tqdm(total=len(data.col_names)) as pbar:
      for j, t in enumerate(types): # for all of cat_names and cont_names
        for i, c in enumerate(t):
          loss=[]
          for x,y in (iter(learn.data.valid_dl)): # for all values in validation set
            col=x[j][:,i] # select one column of tensors
            idx = torch.randperm(col.nelement()) # generate a random tensor
            x[j][:,i] = col.view(-1)[idx].view(col.size()) # replace the old tensor with a new one
            y=y.to('cpu')
            loss.append(learn.loss_func(learn.pred_batch(batch=(x,y)), y))
          pbar.update(1)
          fi[c]=np.array(loss).mean()-loss0
    d = sorted(fi.items(), key=lambda kv: kv[1], reverse=True)
    
    df = pd.DataFrame({'Variable': [l for l, v in d], 'Importance': np.log1p([v for l, v in d])})
    df['Type'] = ''
    for x in range(len(df)):
      if df['Variable'].iloc[x] in cat_names:
        df['Type'].iloc[x] = 'categorical'
      if df['Variable'].iloc[x] in cont_names:
        df['Type'].iloc[x] = 'continuous'
    return df

4 Likes

One other important that I only realized through learning (we all learn together right?) I was looking through documentation on Permutation Importance, see here: https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html

"To avoid re-training the estimator we can remove a feature only from the test part of the dataset, and compute score without using this feature. "

And I realized we actually need to do this on the test set. My other posts have links to what a gradable test set looks like, if anyone can’t find it I’ll post one. But I noticed a very key difference on my feature selection. Now the values actually make justifiable sense to me. Doing this I saw Importance have much more reasonable losses. The solution that I am using now is I pass in a labeled LabelList as test, and on the declaration for loss0, I change iter(data.valid_dl) to iter(test.train_dl)

Overall the features were not different, however it is better practice this way and generally more accepted.

However, I think I may move to implementing something like RFE, as a comparison in case there is relations within certain variables

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn-feature-selection-rfe

I’ve also worked with these problems (feature importance, partial dependence and etc) here post and here the notebook for Rosmann data
As I found out on my data just column permutation can be misleading.
Here is a quote from my notebook:

Wonderful, now we know it (not really) we can move on to Partial Dependence (no)

The first point that hinted me that it is not ok to do that with NN with embeddings was the crazy difference in importance between features (in my other case it was even bigger)
I was looking at the data and could not believe how the features that must be pretty important for the case are in the bottom of this list (it was the case from field I knew a lot about)
And then I noticed that pretty all important features are categorical columns and visa versa And when I (using editable installs) shifted max embedding size to 10 (from up to 600) this gap became much less.
So it became pretty clear for me why embeddings (categorical columns) seem to be more valuable. Each continuous variable is presented with 1 float number. And each categorical – with a vector of several dozens. And when we randomize categorical column, we mess with tens of columns rather with one. Which, obviously, is more harmful for accuracy

What do we do? Will will use the next (much more computational expensive) option.
I sadly present you the process which involves retraining NN for each column (group of columns)

The idea is very simple. We just throw away the column completely, retrain the NN and compare the errors

Maybe it can help

1 Like

Thank you for the insight @Pak! Fascinating! I hadn’t noticed this as my research is entirely categorical data. Perhaps that makes a difference? As the results I’m getting do make sense in the context of what I am studying. Perhaps it’s an isolated incident where if you only have one or the other variable types, it won’t matter as much. Thoughts?

I also have some thoughts on your findings, let me try a few things.

Yes. I think you are right. We deal here with relative importance of different kinds of features. So if you use categorical only, it should work ok.
But after that finding :slight_smile: I’ve pretty lost faith in permutations and now I believe mire in retraining (and multiple retarining for sureness it appeared to be not as comp expensive as it seemed)

I have a few ideas I want to compare, I’ll do the ADULTs dataset just to be sure, as I’m not wholly convinced on the logic yet (just me being feisty) :wink: but I will let you know.

Yes, that will be great. My thoughts are based on two sets only
One thing is: I have some concerns I don’t know if adult dataset is wide (In terms of number of features) enough to explore the difference
But it would be wonderful to know how it worked out

Do you have a dataset that’s wide enough you’d recommend? Because ideally with the permutation importance done above, the only thing that should be replaced is the tensor mapping that is generated through the embedding matrix. And that is all. That is why I’m confused on the logic and want to see it myself.

I really don’t know maybe adult is good enough. I used Rosmann data, but really now I see how adult is much more interpretable by a human for santity check. As Rosmann’s features are mystery for non-expert (Im also non-expert in sales :slight_smile: )

Yes but when you replace index in embedding model get not just one number (index) but a vector of numbers (which represent this index and in can be up to 600 floats per one category) and when you replace continuous variable you only meds with one number. And one number change in general should affect less than changing tens of numbers.
That’s the logic behind my code. Maybe you will get other results from your experiments and it can lead to other interpretations of what is really going on there that we can discuss

1 Like

I believe it should be a look up value, so when we scramble the column we just scramble the one original input. Because all the embedding is doing is mapping a categorical variable to the separate vector of values but will still represent the one input value. So it shouldn’t matter. That is my take at least. I’ll be able to get to some small experiments later this evening too.

1 Like

Yes, now I think I’ve got your idea, that all the values in a vector represent one initial value (index/category), so amount of initial info is still one value.
It will be interesting to see how it goes.

1 Like

Exactly! The exception is when we deal with missing values, as FillMissing maps a binary ‘_na’ variable, but else exactly. I’m working on finding a particular article at the moment that should help clarify more for both of us :slight_smile: (I found it awhile ago and it’s buried somewhere)

@Pak, I’m working on the experimentation now, and I realized something. I know we use the loss function when we calculate it, but considering this is (for the most part) used with classification, would it not be better to show our metric instead? (eg. accuracy) as we can explain the why it works a bit easier and more direct?

In retrospect why not both I think

It was quite time ago I’ve experimented with tabular data. And maybe I don’t really remember all the things I’ve tried. But after I have looked at my code I’ve noticed that I’ve thought about something like that, as func is parameter in my main function calc_feat_importance( ... , func=exp_rmspe, ...)
In fact as Rossmann is not a classification task and we calculate number of sales, my exp_rmspe is like accuracy, so I don’t use loss function as measure for feature importance.
And the same is true for my Partial Dependence analysis

And yes, I agree, I think accuracy will be better for most cases of calculating feature importance

1 Like

By the way here some of my experiments from Rossmann notebook
Top10 features, the more the value, the more important is a feature

The standard approach (with column shuffling):

(‘Store’, 2.7752134208337362),
(‘Promo’, 0.8845676767547609),
(‘DayOfWeek’, 0.43007899025606494),
(‘State’, 0.2258095143853605),
(‘StoreType’, 0.16391973889960224),
(‘Day’, 0.15572013064943885),
(‘CompetitionDistance’, 0.1554957910318926),
(‘Promo_fw’, 0.15135729285537958),
(‘Promo_bw’, 0.11064122675124458),
(‘Week’, 0.09965946372048601)

No cont variables on the top and Store variable that has a very high cardinality (as it is most close to uniqueness) is rated overhigh

And here the retrained version (importance calcucated with throwing away column, retrain and compare results):

(‘Store’, 0.135575061499906),
(‘DayOfWeek’, 0.03104379735704198),
(‘Max_TemperatureC’, 0.029381313915677092),
(‘Max_Wind_SpeedKm_h’, 0.0258247753326603),
(‘State’, 0.02553913667452366),
(‘Promo2SinceYear’, 0.02393384137479487),
(‘PromoInterval’, 0.02128868845742942),
(‘BeforeStateHoliday’, 0.019680976757349807),
(‘trend_DE’, 0.018117666413716534),
(‘CompetitionOpenSinceYear’, 0.01799674571282753)

Store is still the first, but not as much overrated and we can see cont values here…

Disregard the deleted. I realized that I need to link the _na and the categorical together. Give me some more time please :slight_smile: The above is wrong. I need to rework some things.

Specifically the issue with the original and my reiteration is when we shuffle a column that has missing values, we need to shuffle that column and the _na in the exact same way

1 Like

I will definetly check your notebook when I get to a PC. One last thing I forgot to mention is that retraining method is pretty unstable (as it retrains with different initials and produce dome of different accurcies). So I have repeated retrain (in fact I’ve repeated base train for stabilating also) for a number of times (five as I remember) and got the median of accuracy to gain more stable results

Hmmm, I don’t remember I thought about that, maybe I have some errors like this too