Unbalanced datasets and impact to Fairness

I recall Jeremy mentioned in a lesson that “unbalanced datasets” are not an issue and do not adversely affect the accuracy of predictions for the minority samples.

Assuming I understood that correctly, I am confused as I read this paper on sources of unfairness in machine learning algorithms which seems to suggest that imbalanced representation in a dataset can result in unfair outcomes (in other words, poor accuracy?) for the minority samples.

The paper states: “Minimizing Average Error Fits Majority Populations: …if we train a group-blind classifier to minimize overall error, if it cannot simultaneously fit both populations optimally, it will fit the majority population. This is because — simply by virtue of their numbers — the fit to the majority population is more important to overall error than the fit to the minority population. This leads to a different (and higher) distribution of errors in the minority population. This effect can be quantified, and can be partially alleviated via concerted data gathering efforts”.

What am I missing? Is it that Jeremy is referring to “unbalanced” basis the labels while the paper is referring to “unbalanced” basis the features across labels?


It could be a kind of double-edged sword, I believe. If we have unbalanced target values, we should gather more examples with these specific values of the target. However, if we have different clusters in the data, and some of them have only a few observations, then we probably need to make sure that we collect a wider range of values for of the features.

Also, in general, we could treat any feature of the dataset as a target, right? Like, we just pick k-1 columns and use them to predict k-th column. So probably the major question here is to build a diverse dataset which would cover a wide range of possible values to build these “clusters” big enough to discover meaningful signals.

Or in the case of images data, again, we probably would like to have the classes balanced, as well as a wide range of different pictures within each class.

I think the whole point here is that it is impossible to build big enough clusters for minorities, and/or that the features that are unique to them may be too far off from the majority and impossible to accurately represent using the same loss function. In addition, separating those features out would end up strongly correlating to their minority status and count as being discriminatory.

Let us take some examples:

  1. Minority samples: say, in a model that looks at high school chess clubs over the years and predicts likelihood of someone being a part of it. In such a dataset the minority will be women. But we just don’t have historical data of women in such clubs that one can add to the dataset. So this dataset will always remain unbalanced wrt women and might be unfairly biased against women as a result.
  2. Feature uniqueness: consider a Facial Emotion Recognition model trained on faces from across the world. Assume we have sufficient diversity in terms of age, race and ethnicity. But the intensity with which we express emotions is also correlated with cultural norms around us, our perception of self, our personalities etc. For example how big my smile is when I’m happy might be dependent on what is culturally appropriate for me, how I feel about my teeth/jaws, how often I was criticised for laughing too much etc etc. How would you represent or accurately predict all possible intensities of happiness basis these considerations when using visible measurable features like size of smile, volume of laughter, how wide the eyes are open etc?
  3. Feature correlation to protected attributes: this is a well known issue wrt fairness in ML and Rachel has talked about this extensively in this tweet thread so I won’t repeat here.

Coming back to unbalanced datasets - it’ll be good to understand what kinds of imbalance matters and which ones don’t. I suspect if we’re trying to predict across very distinct categories where features are very obviously far apart imbalance may not matter (for example bananas vs strawberries).

I’d love to understand this better… please correct me if I’m confusing tangential aspects! Thanks!

1 Like

I’m exploring that too. I remember hearing Jeremy saying that “NN’s kinda just figure it out” but in an ML course he says you should oversample. Regardless, I bet he would tell us to try it out and see! So I have been observing that when I use oversampling in my training set and making the underrepresented more prevalent, my models are doing a better job at predicting.

This is a tough one. I’m working on a credit card fraud dataset where there is about 1 fraud per 580 transactions. Therefore, the accuracy if we predicted every transaction to be 0, would be accurate 99.8% and would miss every case of fraud. Sure enough, if I set my loss for accuracy, this is what happens. Great accuracy and Zeros for days…

Kaggle competitions are using AUROC to help with this, but as the papers state, the majority tends to dominate still.

I think all of your examples are spot on. The one thing I would add is that you can probably combine different minorities to help with a group being too small.

With my fraud cases, I don’t think I would split between cyber fraud and elder abuse. The groups would be too small.

So for the chess club example, you probably wouldn’t want to split women and other underrepresented groups in a school.

1 Like

That’s an interesting insight. I’m learning that a lot more focus needs to be on curating, reviewing and auditing datasets if you wish to solve real world problems using ML!

That’s a fair point. I would say that some of these tasks are pretty complex in their nature and therefore are challenging for modelling.

Like, in your example about facial emotions recognition, I think it could be difficult to create a “simple” end-to-end model to predict emotions for every nation and culture. In this case, probably some “specialization” in models is required. (Train several models to address each culture).

A more smart approach could be to introduce this cultural information as tabular features and inject into the pipeline so the model will have the additional expert knowledge to rescale its predictions property depending on nationality. For example, we can add embeddings layers and attach information about cultural aspects to the input tensors in hidden layers. Something like (in pseudo-code):

x = torch.cat([x, cultural_embeddings])

Also, maybe some fine-tuned loss functions that will give more weight to the underrepresented cases or include these additional expert information. Like, recall SSD image detectors that incorporate both features responsible for the object’s class and bounding box location.

I think that one of the good points expressed during the course is that sometimes it is not enough to pass the numbers into a network and expect that it will give good (and fair) results but need to bring the human’s knowledge into the process. I believe that this case with “unfair datasets” is a great example that Machine Learning engineers should work with domain experts and bring some additional knowledge into the modeling process to address such issues.

1 Like