If features are correlated, then permutation importance can give biased results. In Interpretable Machine Learning, Christoph Molnar mentions discusses this in the feature importance chapter, particularly in the disadvantages section.
Unless otherwise stated, I’d expect the assumption of feature independence to be a requirement in any method that involves holding some features constant while modifying other features. TreeSHAP doesn’t always have this assumption, although it looks like certain output types do require setting feature_dependence=”independent”
.
In regression analysis, feature independence (or in statistics terms: a lack of multicollinearity between independent variables, predictors, or covariates) is usually a required assumption as we are interested in interpreting the coefficients of the covariates. There are multiple methods for detecting multicollinearity, which we could use to check on our data.
Depending on the circumstances, multicollinearity isn’t always a problem. For example, through feature engineering or domain knowledge we might have a model whose inputs include age and age_squared, which by definition will be correlated with each other. In regression analysis we would always interpret the two coefficients together and never independently. For our tabular neural networks, we’d want to do the same, so perhaps we’d modify permutation importance to always permutate age and age_squared together. Likewise if we have a interaction term or other combinations of features.
Small amounts of multicollinearity between features we want to be independent might not be completely problematic . The real world is messy, and practitioners don’t always have ideal data. Unfortunately, there are no hard and fast rules on what counts as acceptable multicollinearity, but various rules of thumb. An example, if we are modeling children’s health age, weight, and height are probably going to be correlated with each other. But if the correlation isn’t too large, we can still look at their feature importance assuming we are careful in our reporting and interpretation, and if we recognize and acknowledge that our results might be biased. Or, depending on the method used, we could treat them as control variables and limit our analysis to other features.
From a statistical practitioner’s perspective, if you want to interpret the feature importance of tabular neural networks I’d recommend this non-exhaustive list:
-
Start by plotting a pairwise plot and correlation matrix of all the data. This is more of an eyeball test for collinearity, as it can only reveal pairwise correlation, not multicollinearity.
-
Normalize the data. Data normalization can remove certain types of collinearity. Keep in mind that domain knowledge might suggest something other than straight normalization. For example, when working with economics data the natural log of income often more useful for interpretation than normalized income.
-
Run at least one multicollinearity test. Preferably multiple. A non-exhaustive list of options includes variance inflation factor (VIF), the Farrar–Glauber test, perturbing the data, and conditional number test. Of these, only VIF appears to have a python implementation in statsmodels, the rest have R packages. Be careful with VIF in statsmodels, as it appears by default it doesn’t include a constant term so you’ll need to add another column to your data filled with ones.
-
Remember that some forms of multicollinearity are not deal breakers if handled correctly. This would depend on what type collinearity and what type of feature importance analysis being applied.
-
Even if all the statistical tests look good, there could still be undetected multicollinearity. So always be careful when presenting results.
Keep in mind that even with completely independent features, there are other factors that could bias feature interpretation. Some examples include omitted-variable bias and dealing with repeated measurements from a longitudinal study (measuring patients over time) or measurements made on clusters of related items (studying students in schools).
Any feature importance package, or addon to fast.ai, should clearly mention the assumption of feature independence if required.
Let me know if you have any questions or corrections.