Find the features that most influence an outcome

sebderhy · January 18, 2020, 2:55pm

Hi everyone,

I have an outcome that depends on ~1000 features, and I was wondering if there is a standard/optimal way to find the features that most impact this outcome.

I thought about fitting interpretable models such as decision trees or random forests, but I wonder if it’s right to use machine learning for this, because I think a feature can have a lot of influence without necessarily be a good predictor (for example, if a feature generates a lot of randomness in the output when given a certain value).

Does anyone have any advice or pointer to resources regarding this topic?

Thanks

muellerzr · January 18, 2020, 3:07pm

Look into permutation importance (there’s a number of posts on this). It works like the following:

Train your NN
On your validation set, shuffle one column to ‘remove’ any information links there
Compare it to the original baseline

This tells you your % difference for each feature and allows you to narrow down what features your NN is utilizing the most

sebderhy · January 22, 2020, 4:10pm

@muellerzr Thanks a lot !