Binary classification model for unbalanced datasets

I’ve a language model which is generate fragments which make sense.

I’m using this encoder to train a binary classification model based on the IMDB example, but my dataset has 5% positive classes. Do I need to re-balance this dataset, or make any adjustments to hyper parameters or loss function for this situation.


1 Like

The quick answer is scale the loss depending on which class your are predicting (it should be inversely proportional to the amount of data for that class).

Have a look at for more details

I have come across a fraud problem, where non fraudulent transactions consists of 99% data, and remaining 1% is fraudulent transactions. Have you come across any paper which handles this issue, please share.

If you have such a strong imbalance, an anomaly detection algorithm is probably better suited than deep learning.