One of the training images from Kaggle Dogs vs Cats redux, namely dog.5604.jpg, though has a label of “dog”, the image actually is just a text “camera shy” (i.e. not a dog)
This is what dog.5604.jpg looks like:
I wasn’t expecting Kaggle training dataset to have abnormal observations like this? Or does Kaggle require us to manually check and remove incorrectly labeled images?
It might be an anti-cheating measure, or a poops-n-giggles mechanism. Data cleansing is a huge part of being an analyst / hacker, so being able to identify why one’s algo assigns a Huge loss onto certain outlier samples is something we should all be familiar with doing. If you have a huge dataset, manually looking validating it is not an option, so you’d have to use an automated process like the one described above. In the sealions competition and the cervix cancer competition, you could do other processing such as color histograms, or looking at the comparative image sizes and depths.
This is one of the few areas that separates the plug-n-play competitors from the top LB scorers. How much sweat and blood they put into fine tuning their solutions.
Believe it or not but it was “pure luck” - whilst performing the validation part (displaying some sample images per sector of the confusion matrix) - the camera shy image pops up. At first I thought it was a bug in my code then realised it was the actual data! Glad to know I’m not the only one then! (as suggested by @haresenpai it might well be a Kaggle “feature” / anti-cheat measure.) Thanks for checking your data!
Interesting observation. This is hard to say if it is on purpose or just an error. Maybe they wanted to promote this useful algorithm : https://arxiv.org/abs/1412.6596