Porto Seguro’s Safe Driver Prediction -- Dealing with Unbalanced Data

groverpr · October 28, 2017, 7:44pm

Since the data for “Porto Seguro’s Safe Driver Prediction” competition is highly imbalanced (Only 3.6% of 1’s in training data and remaining are 0’s), here are a few starter links to deal with that. For now, I tried downsampling and tuned random forest model.

Other things that can be tried -
– SMOTE for imbalanced data
– XGBoost for modeling
– Reading discussion forum on Kaggle. Some people with domain expertise are coming with great insights about features.

Please add any other suggestion related to this competition.

Links –

https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html

PS : here is the kernel I started for this competition. The results are very bad for now, but I am working on improving it.
https://www.kaggle.com/grroverpr/tuning-random-forest-downsampling-transformation/

jeremy · October 29, 2017, 8:48pm

The simplest approach is to add the param class_weight="balanced" to the RF constructor. It may not be the best, but it’s likely a good start. Note that you may need to “undo” this in your actual predictions - it depends what the eval metric in the comp is.

groverpr · October 29, 2017, 10:02pm

I tried that. But model fitting by using class_weight = ‘balanced’ was taking a lot of time.
Sorry, I can’t understand what you mean by – " ‘undo’ in actual prediction " ? using class_weight = ‘balanced’ will give a model which we can use to predict for unknown observation. What should be done after that?

jeremy · October 29, 2017, 10:24pm

We’ll learn about how to speed things up on Tuesday.

We’ll talk about handling unbalanced data in class too - answering your 2nd question probably needs that background…