Dogs vs. Cats Redux: removing incorrectly labeled images?

In my manual evaluation of the data for Dogs vs. Cats, and in particular in the images that my model is getting wrong I ran across a number of images that were either incorrectly labelled (generally blocks of text) or arbitrarily labelled (images with both a dog and a cat; cat.1450.jpg is an example).

I ascribe to the adage ‘garbage in -> garbage out’ but I’m not that familiar with kaggle so I thought I’d ask here and see what people think.

Is it okay to clean the training and validation data? And if so could it be done manually or does that violate kaggle’s terms in terms of not manually labeling? I’m assuming they’re referring to the test set input but I could see an argument for wanting to eliminate all manual processes.

I’ve been thinking about simple ways to automatically identify at least the arbitrarily labelled images so that they can be removed since my guess is they’re going to cause the most harm when it comes to training but without relabeling or coming up with other example images I’m not sure I can automate it.

Examination of the test data didn’t reveal images to me that were incorrectly labelled, which makes a lot of sense; If there were incorrectly or arbitrarily labelled images then the winners of the competition are essentially being chosen at random in a close field.

So what do you guys think? Is it okay to improve the training and validation set? And if so can you do so manually? Obviously you do so at you’re own risk of losing information, but I’m inclined to clean the training set and see where that gets me.

It depends on the competition rules, most competitions allow for manual relabelling of the train data, some require that you share this information on the forums. As an example: https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring/discussion/28150

1 Like

Andrew Ng in his Deep learning course talks about this issue. Here are few statements that I remember:

  • mislabeled data in training set usually is not a big deal, deep learning models are surprisingly good
  • the validation set and the test set data should come from the same distribution. if you fix labels in one, you should fix it in the other. Since you don’t have control over the test set, it might not be the best idea to fix your training and validation set.

If you want to give it a try anyway I would try first implement this paper: https://arxiv.org/pdf/1705.03419.pdf

That address noisy sets, although it seems to be aimed at large data sets.
If you decide to give it a try tensorflow has a new version of softmax that simplifies such implementations:
tf.nn.softmax_cross_entropy_with_logits_v2