Abnormal Kaggle "Dog" picture - camera shy?

Atlas7 · July 2, 2017, 2:13pm

One of the training images from Kaggle Dogs vs Cats redux, namely dog.5604.jpg, though has a label of “dog”, the image actually is just a text “camera shy” (i.e. not a dog)

This is what dog.5604.jpg looks like:

dog.5604.jpg

I wasn’t expecting Kaggle training dataset to have abnormal observations like this? Or does Kaggle require us to manually check and remove incorrectly labeled images?

nheagy · July 2, 2017, 3:27pm

I just double checked my data as well, and I have the same image. Kind of disappointing, but it definitely ought to be removed from the data!

I’m curious—how did you find it?

haresenpai · July 2, 2017, 3:37pm

Don’t take it seriously.

This sort of thing happens all the time, e.g.: Exhibit A, Exhibit B.

It might be an anti-cheating measure, or a poops-n-giggles mechanism. Data cleansing is a huge part of being an analyst / hacker, so being able to identify why one’s algo assigns a Huge loss onto certain outlier samples is something we should all be familiar with doing. If you have a huge dataset, manually looking validating it is not an option, so you’d have to use an automated process like the one described above. In the sealions competition and the cervix cancer competition, you could do other processing such as color histograms, or looking at the comparative image sizes and depths.

This is one of the few areas that separates the plug-n-play competitors from the top LB scorers. How much sweat and blood they put into fine tuning their solutions.

Atlas7 · July 2, 2017, 3:59pm

Believe it or not but it was “pure luck” - whilst performing the validation part (displaying some sample images per sector of the confusion matrix) - the camera shy image pops up. At first I thought it was a bug in my code then realised it was the actual data! Glad to know I’m not the only one then! (as suggested by @haresenpai it might well be a Kaggle “feature” / anti-cheat measure.) Thanks for checking your data!

nheagy · July 2, 2017, 4:58pm

I don’t understand how this would be useful to reduce cheating?

Atlas7 · July 2, 2017, 5:03pm

Honestly I am not sure. Maybe to simulate in real life that not all images are labeled correctly? (just a guess!)

haresenpai · July 2, 2017, 6:26pm

No idea. The ‘anti-cheating measure’ is verbiage lifted directly from the description section of multiple contest pages.

alexandrecc · July 3, 2017, 7:55pm

Interesting observation. This is hard to say if it is on purpose or just an error. Maybe they wanted to promote this useful algorithm : https://arxiv.org/abs/1412.6596

Sam1 · July 14, 2017, 4:03pm

Check out cat.4688.jpg

it appears to be another abnormal picture.

Sam