In my manual evaluation of the data for Dogs vs. Cats, and in particular in the images that my model is getting wrong I ran across a number of images that were either incorrectly labelled (generally blocks of text) or arbitrarily labelled (images with both a dog and a cat; cat.1450.jpg is an example).
I ascribe to the adage ‘garbage in -> garbage out’ but I’m not that familiar with kaggle so I thought I’d ask here and see what people think.
Is it okay to clean the training and validation data? And if so could it be done manually or does that violate kaggle’s terms in terms of not manually labeling? I’m assuming they’re referring to the test set input but I could see an argument for wanting to eliminate all manual processes.
I’ve been thinking about simple ways to automatically identify at least the arbitrarily labelled images so that they can be removed since my guess is they’re going to cause the most harm when it comes to training but without relabeling or coming up with other example images I’m not sure I can automate it.
Examination of the test data didn’t reveal images to me that were incorrectly labelled, which makes a lot of sense; If there were incorrectly or arbitrarily labelled images then the winners of the competition are essentially being chosen at random in a close field.
So what do you guys think? Is it okay to improve the training and validation set? And if so can you do so manually? Obviously you do so at you’re own risk of losing information, but I’m inclined to clean the training set and see where that gets me.