Advice on Microsoft Malware Detection kaggle competition

I’m trying to use fastai.tabular to predict malware detections in Kaggle’s Microsoft Malware Detection competition. My model scores poorly (0.657 AUC, which is ranked 1768/2282 and is 0.053 behind the #1 score). The kaggle forums say that even the most basic blind models are scoring around 0.67, but of course these are not users. I think they’re using more standard ML techniques/packages. Does anyone have an opinion on whether the fastai.tabular deep learning model simply doesn’t perform as well for this particular case? If I recall correctly, Jeremy in his course said that he has found that 90% of the time the fastai.tabular library is as good as more conventional structured data ML, so I wonder if this is just one of those 10% that won’t work.

I realize I haven’t given a detailed description of how I’m handling the different categorical/continuous variables, or how I’m splitting validation. I can get into this if anyone is interested.

1 Like

In that competition you get training data from one period of time and test data from a much later period of time. The trick is working out what features will generalize well over long time periods; it’s very hard to learn that from the data because most of the training data comes from a short period of time.

So I’d think to do well in that competition you’ll need to do quite a bit of hand engineering of features (or at the very least import a bunch of external supporting data); I think that’s going to be more important than the learner you use.

1 Like