When should I drop/ignore features when doing learning on tabular data?

JoshVarty · February 8, 2019, 12:22am

While going through v3 of the course, I’ve been looking at the Microsoft Malware Detection Kaggle competition. The dataset is tabular and involves primarily categorical data.

Some columns have literally millions of examples of one value, with hundreds of another. For example, for the boolean column AutoSampleOptIn we have 8,921,225 examples of True and only 258 examples where it is False. For this reason I don’t expect to get much predictive value out of this column and feel like I should drop it from my training/test sets.

Is there a rule of thumb or guidelines that suggest when I should drop a feature like this?