Creating a good test set - class distribution

Hi everyone,

I want to create a representative and robust test set for benchmarking a text classifier. The target dataset I want to get predictions for is highly imbalanced (e.g. 90% one class, 10% the other class). I labelled a subset of the full dataset and created a training set with a balanced class distribution. My question is now if the test set should also be balanced or have the original imbalanced class distribution of the full dataset. Are there advantages/disadvantages to these approaches?

Please feel free to share your thoughts and/or experience. Any insight is appreciated :slight_smile:

For anyone who’s interested, I did some more reading on the topic and here are some of my takeaways.

If you have a balanced test set (or validation set) it’s a bit simpler to evalute the model since metrics like accuracy give a better picture of model performance.

If you have an imbalanced test set (especially if heavily imbalanced) you should be more careful with evaluation since accuracy can be misleading. So obviously you would look at other metrics like precision, recall and/or F1 score. The advantage of using a test set that has the same (or at least a similar) distribution as the target dataset is that it gives you a more realisitic performance estimation of how the model would score in production.

Then there is the not-so-uncommon case of having an imbalanced dataset where you care more about the underrepresented class than the majority class. In this case you need to look anyway at the model performance at the individual class level with special focus on the more important one. And since there is no point in throwing away labelled examples from the majority class (of which you usually have much more) in order to rebalance the test set, I wouldn’t do so.

So to sum up, I think it’s better to use a test set that resembles your real class distribution. But even though I read comments like “never balance your test set”, as long as you interpret your results carefully I think it doesn’t matter too much in practice.