While going through v3 of the course, I’ve been looking at the Microsoft Malware Detection Kaggle competition. The dataset is tabular and involves primarily categorical data.
Some columns have literally millions of examples of one value, with hundreds of another. For example, for the boolean column AutoSampleOptIn
we have 8,921,225
examples of True
and only 258
examples where it is False
. For this reason I don’t expect to get much predictive value out of this column and feel like I should drop it from my training/test sets.
Is there a rule of thumb or guidelines that suggest when I should drop a feature like this?