Hi all, I’d like to share an elegant trick I’ve come across recently - a simple generalization of standard permutation importance to feature interactions.
Permutation importance for a single feature was discussed in ML course. At the end of lesson 6 there was a question about extending it to interaction importance without a particularly satisfying answer - this is the gap I hope to fill in this post.
Say, we want to measure the importance of the interaction between features x_i and x_j in a model in terms of its effect on some evaluation metric, such as AUC. Let’s measure the AUC on original validation data, as well as AUC_i = AUC on the validation data with shuffled x_i, AUC_j - with shuffled x_j and AUC_ij - with both features shuffled.
Key observation: when we shuffle a feature we are killing not only the feature itself, but also the interaction.
Writing that out in algebraic terms:
AUC_i = AUC - imp_i - imp_ij
AUC_j = AUC - imp_j - imp_ij
AUC_ij = AUC - imp_i - imp_j - imp_ij
And solving for imp_ij:
imp_ij = AUC - AUC_i - AUC_j + AUC_ij
That’s it. A simple black-box method just like the original random shuffle importance - it doesn’t need to know anything about the model’s internals, only needs the ability to evaluate the model on modified data.
Footnote: strictly speaking the first formulas should also have terms for the interactions with other features, but they’ll cancel out in the end.
I like the simplicity of this. I think the key to its implementation on most datasets with more than just a handful of features is to figure out some way to filter down pairs of features to permute simultaneously. Otherwise, the search space can be too large if p + (p choose 2) permutation/predictions are required (here I have denoted p as the number of features in the model). Perhaps first permute the p features independently, rank these independent importances, and then focus on only permuting the top k of these p features pairwise, where k << p.
Yes as it’s testing each pair of features independently it could be kind of slow with hundreds of features.
You might want then to prefer methods that inspect the model’s internals directly, e.g. xgbfir. The implementation is messier and results are less interpretable as they report them in terms of internal model properties like information gain, split counts rather than something you actually care about - AUC, accuracy or some other evaluation metric. But it could useful as an initial filtering step. Another way to filter could be to build a new ensemble with just depth=2 trees, check what interactions get picked the most by it and then test those on the full model.
If you’re using random forests rather than boosting and testing on OOB for each tree, then you only need to test for each tree the pairs of features that appear in the tree (just like it should be done for single feature importance testing per Breiman’s original RF paper). With limited tree’s depth this should cut down number of pairs to test significantly.
Another model-agnostic interaction importance method I’ve seen is Friedman’s H-statistic, but that one is quadratic in dataset size, whereas mine is quadratic only in features count.