Find the features that most influence an outcome

Hi everyone,

I have an outcome that depends on ~1000 features, and I was wondering if there is a standard/optimal way to find the features that most impact this outcome.

I thought about fitting interpretable models such as decision trees or random forests, but I wonder if it’s right to use machine learning for this, because I think a feature can have a lot of influence without necessarily be a good predictor (for example, if a feature generates a lot of randomness in the output when given a certain value).

Does anyone have any advice or pointer to resources regarding this topic?

Thanks

Look into permutation importance (there’s a number of posts on this). It works like the following:

  1. Train your NN
  2. On your validation set, shuffle one column to ‘remove’ any information links there
  3. Compare it to the original baseline

This tells you your % difference for each feature and allows you to narrow down what features your NN is utilizing the most

3 Likes

@muellerzr Thanks a lot !