I’ve also worked with these problems (feature importance, partial dependence and etc) here post and here the notebook for Rosmann data
As I found out on my data just column permutation can be misleading.
Here is a quote from my notebook:
Wonderful, now we know it (not really) we can move on to Partial Dependence (no)
The first point that hinted me that it is not ok to do that with NN with embeddings was the crazy difference in importance between features (in my other case it was even bigger)
I was looking at the data and could not believe how the features that must be pretty important for the case are in the bottom of this list (it was the case from field I knew a lot about)
And then I noticed that pretty all important features are categorical columns and visa versa And when I (using editable installs) shifted max embedding size to 10 (from up to 600) this gap became much less.
So it became pretty clear for me why embeddings (categorical columns) seem to be more valuable. Each continuous variable is presented with 1 float number. And each categorical – with a vector of several dozens. And when we randomize categorical column, we mess with tens of columns rather with one. Which, obviously, is more harmful for accuracyWhat do we do? Will will use the next (much more computational expensive) option.
I sadly present you the process which involves retraining NN for each column (group of columns)The idea is very simple. We just throw away the column completely, retrain the NN and compare the errors
Maybe it can help