Gender and Race Bias in Duckduckgo based dataset

Hi I built up my female/male classifier using search_image function in Lesson1 using Duckduckgo by searching for “female photos” and “male photos”, and it performs well on the testing set (err_rate = 0.17). However, the searching result of Duckduckgo and thus the whole dataset, including testing set, is biased. As a result, there are some interesting “mistakes” the classifier made:

  1. When I tested drag queen photos, the one with beard and drag has female probability = 56%, and the one with half drag queen face and half male face has a super high female probs = 90% (see 2nd img pairs). Seems like the model depends on female features rather than male features to determine the gender, indicating how female are thought to belong to the second gender that’s separated from “nature” human - males.
  2. When tested with angry female photos (3rd img pairs), the model wrongly detected one of them to be male with male prob = 66%, while another angry female face with gentler facial expressions still got high female prob = 98%
  3. When tested with muscle female photos (4th img pairs), the model detected one of them as male, showing their female probs = 72% and 42%, with female probs decreasing as muscle increasing
  4. When tested with diferrent races (European and Asian) male photos (5th img pairs) with almost the same posture, the model has slightly higher confidence on the European one (male probs = 97%) than the Asian one (male probs = 89%).
  5. At the same time, the model performed the same on European and Asian female photos (6th img pairs) with similar posture.
  6. I tested male tears-noTears (tears eliminated by PS) photo pairs (7th img pairs)to see if non-vulnerability is one of the charactrissitcs the model learned, but no, surprisingly the one PSed without tears has actually lowered the male prob from 60% to 49%. I wonder if it’s becuase PS contaminated the authenticity of the photo, or removing tears helped eye-detecting, or it’s also caused biased dataset (though traditionally males are encouraged to be less emotional)?

Thanks for the stimulating post. It took me a while to work out how to unpack it.

I think its awkward to attribute a “mistake” to images that deliberately mix categories, like drag queens. 50/50 seems reasonable for the one above. I personally can’t tell if its a man with makeup, or a woman with a beard.

and the one with half drag queen face and half male face has a super high female probs = 90% (see 2nd img pairs). Seems like the model depends on female features rather than male features to determine the gender

indicating how female are thought to belong to the second gender that’s separated from “nature” human - males.

While this is an unfortunate reality in some parts of the world, it seems a stretch to attribute such a humanistic meme to this model. In the broader sense, it makes me wonder how hard it is for any-of-us to guard against our own biased interpretation that results. How do we push ourselves to dig deeper to verify beneficial-results supporting a point we’d like to make. Here is one insigthful TED Talk about someone coming to terms with their bias.

What I found interesting was exploring which features the model depends on for its inference. This can be done by masking which part of the image is processed. Comparing the next two snapshots, its amazing the impact of such a small slice of lipstick.

Classification using bright coloured lips seems simple and absent a complex humanistic meme.

Perhaps the model should be taking more notice of the stubble, except it mightn’t needed to learn that feature when overtly coloured lips were sufficient. It might be interesting to desaturate the drag queen photos and see how the inference changes, or to train a model of “women without makeup” and see how the inference of the drag queen changes.

The other feature I found really interesting was the two-thirds chance of female from considering only the male’s eye. Note this was a very narrow band that I stumbled on. Including any more of the man’s lips or his neck & shoulder reverted to a strong male prediction.

Comparing the next few images, keeping the hair colour out of play,


it seems other female indicators are:
large eye lashes and rosy cheeks:

not the large earing:

fairly decisive hair colouring:

long fingernails are the kicker:

  1. When tested with muscle female photos (4th img pairs), the model detected one of them as male, showing their female probs = 72% and 42%, with female probs decreasing as muscle increasing

That seems a reasonable reflection of reality. But wow! who’d have thought leaving out the left elbow would make such a difference.

1 Like

Wow thanks for all the interesting testing! I also noticed how the model predicts based on human body in an unexpected way, like I cut off half of the red hair part of the drag queen photo, and the female probs increased from 90% to 97%.

This if definitely worth trying! However it’s super hard to get what I want by just searching “female without makeup photos”, as shown below:

So I wonder, instead of testing a blackbox like this, is there any way to see how the model works? I know there is a way to visualize different layers’ convolutional kernel, but I don’t know if that’s gonna show any interpretable visualization in the deeper layer, where is the classifier really did transferred learning from ImageNet right?

1 Like