Improving accuracy for unbalanced datasets

Hey guys, I am having trouble to deal with unbalanced datasets on kaggle especially when I try random forests. What are the key suggestions that I should do to improve my ROC score.

What type of data in particular? Can you do any data augmentation? Something extremely crude is to add the rarer instances multiple times to the data set. Using extratrees instead of random forest might also increase the ROC since feature selection is more random rather than based on importance.

Structured data, to be exact I am working on Home Credit Default Risk on kaggle, I’ll make sure I try adding the rarer instances, How much impact does it make since you are adding the data that is already present

Hi!

Random Forest is a great starter for any ML-task. In the case of unbalanced datasets, it is very easy to tweak it such that in the bootstrapping part of the algorithm it will sample one of the classes much more often than others. Essentially, oversampling the low-frequency class such that the algorithm learns as much as possible from any observation available there.

Using sklearn you would say:

RandomForestClassifier(..., class_weight = 'balanced')

To your question about the impact it can have, it really depends on the situation. A way to gauge if it will improve is to dissect your predictions and check your precision/recall. If they are really bad for the imbalanced class, then definitely it is worth trying.

If you want to use NN’s in the same kaggle competition, check my Kaggle kernel :wink: : https://www.kaggle.com/davidsalazarv95/fast-ai-pytorch-starter-version-two

I tried using class_weight = ‘balanced’ for some reason my ROC score got worse

But I’ll defenitely try looking into your kernel and may be I’ll try something similar.