Approaches to test stability of feature selection


I am working with tabular data and looking for ways to test for feature stability in the model. Model is used for banking application so needs to have consistently explainable features (i.e. two customers with similar characteristis should have same top features from the model). Features are selected automatically using wrapper method.

How would I go about ensuring the features selected in the model are stable? One approach I can think of is to use k-fold cross validation (including selecting features for each fold ) to assess whether I get the same set of features for each fold and the overall model. However, I am not sure if this test is conclusive either way. Also it wil be extremely demanding computationally.

Are there other better approaches to test this? Any practical suggestions here would be very helpful. Thanks in advance!

I think it all depends on what you mean by stability.
I agree with you that a way to look at this is whether building a new model on different samples of your data would produce always (more or less) the same order of features by importance (I assume we are not in the NN domain here, but more in the trees ensembles one).
I don’t think this would be super computationally demanding. It is a one shot exercise and it can go pretty fast. Training ~100 different models on the same number of randomly sampled datasets (~10k rows) with XGBoost should not take more than a few minutes (~100 features).

Another way to look at it, which is model-unrelated is to check if your features are densely populated and, most importantly, if they are ~uniformly populated over time. Not sure which project exactly you are working on, but for almost everything you’d have a timestamp associated to the event you are modelling. Say you are in credit scoring and you are using average salary to predict probability of default, you really want to make sure that the salary feature is populated in a stable way across time, e.g. average salary for credit applications in Feb should not differ much from the same in Apr. If you notice large shifts, then your feature is not “stable”.

This is a separate issue, though. A very important one, but a little different from stability.