Advice on Microsoft Malware Detection kaggle competition

dfarrell · March 8, 2019, 2:33am

I’m trying to use fastai.tabular to predict malware detections in Kaggle’s Microsoft Malware Detection competition. My model scores poorly (0.657 AUC, which is ranked 1768/2282 and is 0.053 behind the #1 score). The kaggle forums say that even the most basic blind models are scoring around 0.67, but of course these are not fast.ai users. I think they’re using more standard ML techniques/packages. Does anyone have an opinion on whether the fastai.tabular deep learning model simply doesn’t perform as well for this particular case? If I recall correctly, Jeremy in his course said that he has found that 90% of the time the fastai.tabular library is as good as more conventional structured data ML, so I wonder if this is just one of those 10% that won’t work.

I realize I haven’t given a detailed description of how I’m handling the different categorical/continuous variables, or how I’m splitting validation. I can get into this if anyone is interested.

edwardjross · March 8, 2019, 9:15am

In that competition you get training data from one period of time and test data from a much later period of time. The trick is working out what features will generalize well over long time periods; it’s very hard to learn that from the data because most of the training data comes from a short period of time.

So I’d think to do well in that competition you’ll need to do quite a bit of hand engineering of features (or at the very least import a bunch of external supporting data); I think that’s going to be more important than the learner you use.