Porto Seguro’s Safe Driver Prediction -- Dealing with Unbalanced Data

Since the data for “Porto Seguro’s Safe Driver Prediction” competition is highly imbalanced (Only 3.6% of 1’s in training data and remaining are 0’s), here are a few starter links to deal with that. For now, I tried downsampling and tuned random forest model.

Other things that can be tried -
– SMOTE for imbalanced data
– XGBoost for modeling
– Reading discussion forum on Kaggle. Some people with domain expertise are coming with great insights about features.

Please add any other suggestion related to this competition.

Links –


https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html

PS : here is the kernel I started for this competition. The results are very bad for now, but I am working on improving it.
https://www.kaggle.com/grroverpr/tuning-random-forest-downsampling-transformation/

5 Likes

The simplest approach is to add the param class_weight="balanced" to the RF constructor. It may not be the best, but it’s likely a good start. Note that you may need to “undo” this in your actual predictions - it depends what the eval metric in the comp is.

1 Like

I tried that. But model fitting by using class_weight = ‘balanced’ was taking a lot of time.
Sorry, I can’t understand what you mean by – " ‘undo’ in actual prediction " ? using class_weight = ‘balanced’ will give a model which we can use to predict for unknown observation. What should be done after that?

We’ll learn about how to speed things up on Tuesday.

We’ll talk about handling unbalanced data in class too - answering your 2nd question probably needs that background…

1 Like