Dropping Columns Based on Feature Importance

kcturgutlu · November 3, 2017, 1:51am

Hello,

I am going through today’s notebook. I would like to ask something to figure out whether I’ve understood it correctly or not.

Here are some statements I tried to make, please correct me and/or add missing points.

1) Dropping columns based on the feature importance might get rid of collinear features.

2) Getting rid of collinear features will increase the weight of their collinear counterparts which are left in the model. Since both sets of features were giving similar signals, this was causing weights on that direction to be divided among them.

3) After having left with purer features (like more orthogonal vectors) we can make better interpretations on each individual features.

4) Dropping features may give us better results, because having left with purer features (more signal/noise ??) and having a threshold as max_depth, our model might then use this new subset of features to give a better generalization due to it’s simplicity. (this part is more clear with data leakage example given during the class, sometimes only a single column can map the desired relationship but additional features might add noise, but if this was the case wouldn’t RandomForest stop at a single split on this feature ?).

5) Open question: We often desire simpler models for the sake of better generalizing a phenomena, does that mean dropping columns is always better if there is not much signal in some features or should one squeeze every bit of information from those features of course without over fitting ?

Thanks

jeremy · November 3, 2017, 2:32am

Looking good! I’m not entirely sure about the answer to (4). I think it’s simply this: imagine you had 1000 columns, and only one was useful. And you used 50% of features for each split. Then often you’ll split on totally useless features. By removing useless features you’ll avoid these pointless splits. In practice, it’ll hardly ever happen, and increasing the number of features at each decision makes it less likely too, so we really see very little difference.

The latter. Don’t drop columns that reduces your validation accuracy by an appreciable amount. If it was generalizing better, your validation accuracy would show it!