I am currently working on a binary image classifier - the models goal is to identify any images where Donald Trump is present. I decided the best route was to have two categories non-trump and trump images.
The non-trump training data is just a set of random images, I attempted to add in some extra images of peoples faces (non-trump obviously) so it wouldn’t just learn to detect the difference between images with and without people/faces. The trump images are just a bunch of images with Donald Trump present.
I trained the model and it did pretty well (error rate = 0.05) but because it’s \ a binary classification model I feel like it can do better (< 0.01). When I look at the interp.plot_top_losses it seems like the model struggles the most when there is a non-trump image with a face in it (see image below… sorry Obama)
Hey, this type of problem is actually called one class classification. It’s a relatively difficult problem because the non-trump class can be practically anyone. I remember there are a few threads about it on the forum, but I don’t think I remember seeing any implementations. Check them out. Cheers
Based on the findings from the article I decided to add in more training data that was ‘similar’ to the “One Class”. The easiest way to do that was to use the plot_top_losses function. I was able to sort out that the model was classifying many images fo faces as Trump.
I added in about 100 random face images to my training data set in the non-trump class and retrained my model (the non-trump data set was about 50% random images and 50% faces) this didn’t seem to do the trick so I took another look at what images it was struggling with using plot_top_losses and it was not just faces but faces of men who were wearing suits ( it particularly failed with red headed men in suits which cracks me up).
So I decided to dump in a bunch of images of men in suits to the training data and that seemed to improve the model greatly, going from a 6% error rate to a 2% error rate. I took a scroll through the photos I had of Donald Trump and noticed that he had a suit on in 95% of them, hence why the model was classifying men in suits as trump images.
With that knowledge I decided to beef up the training data for the trump class with photos of Donald with out his suit on, surprisingly not many of those exist unless he is golfing so now i’m worried it might classify golfers as well, but we will see!
The more I think about this type of one class classification, I realize is that it is difficult to predict what exactly the model will confuse for the “one class” - when put in production other things may come up that weren’t included in my validation set (maybe people with bad spray tans). If I do deploy this to production it would need lots of monitoring to see what it does classify as trump, it would be good to periodically retrain the model adding in new training data with confusing examples.
Sorry if this is a bit long winded I found the whole process pretty interesting and wanted to provide as much info as possible for someone who stumbles upon this in the future.
Are you sure your model is not falling into something like the Base Rate Fallacy?
Particularly it happens when you have too many examples of one class, and too little examples of another.
It is particulartly easy to fall into that fallacy here as there are a lot of non-trump images as compared to Trum images to choose from. So one possible reason your model could be behaving as such could be this.
Train the model with images containing face only.
Develop a model that can locate the face in an image and do the prediction based on that.It might improve the accuracy. Opencv lib will help u to do the face localization. Just give it a shot🤞
So I ran into this problem sometimes as well.
As I said it is really easy to find pictures of non-Trump people than Trump. So, if your data is skewed towards one example, then the model might find it more difficult to generalize to other data.
So, having Equal amount of data for both of these classes might help a little.
The other part was explained addressed by @pravinvignesh.
You might also have better success with training data on non-Trump images with suits, and make sure that you have enough examples of blond men in suits. But I guess you would have already figured that part out.