Binary classification model for unbalanced datasets

brightsparc · December 13, 2017, 4:22am

I’ve a language model which is generate fragments which make sense.

I’m using this encoder to train a binary classification model based on the IMDB example, but my dataset has 5% positive classes. Do I need to re-balance this dataset, or make any adjustments to hyper parameters or loss function for this situation.

Thanks,
Julian.

marcemile · December 13, 2017, 10:48am

The quick answer is scale the loss depending on which class your are predicting (it should be inversely proportional to the amount of data for that class).

Have a look at http://arxiv.org/abs/1710.05381 for more details

saurabhjha21 · December 13, 2017, 11:32am

I have come across a fraud problem, where non fraudulent transactions consists of 99% data, and remaining 1% is fraudulent transactions. Have you come across any paper which handles this issue, please share.

marcemile · December 13, 2017, 12:54pm

If you have such a strong imbalance, an anomaly detection algorithm is probably better suited than deep learning. https://en.wikipedia.org/wiki/Anomaly_detection#Popular_techniques