Should I clean up data that obviously does not belong?

saidaspen · February 27, 2018, 6:51am

Doing the Cats Vs. Dogs lesson right now and looking at the images which the model had the hardest time classifying. Some of those examples are neither dog nor cat, and obviously impossible to classify.

What is best practice in situations like this? Let’s say this was a Kaggle competition. What should one do?
Should samples like that be left in the data or cleaned out?

I am thinking that samples like that cleaning them out would be better, but I am not sure.
My thinking goes like this:

These bad samples do not contribute to training my model (because their class is undefined)
Even if bad samples like this is then presented to the trained model when using it for prediction, as the outcome is undefined anyway, it will not matter if my model has seen such examples or not during training.

Am I doing some logical mistake here?

radek · February 27, 2018, 10:14am

I think there are no quick and fast rules for approaching such situations. Your reasoning seems fine to me.

Andrew Ng in one of his lectures makes an interesting point - all such considerations should probably be driven by error analysis. For mislabeled data, you could look at the errors your model produces. If you are getting a 100 errors and 80 of them are due to mislabeled data, then that is probably a problem. If your model produces 100 errors, but only 5 of them are due to mislabeled data, then potentially your time would be better spent elsewhere to improve its performance then on cleaning up the data sets.

I know it might not be very satisfying, but another answer here is probably to give it a shot if it doesn’t require a whole lot of time to clean the data. IIUC neural networks should be quite robust to noisy labels / incorrect data (as long as the noise is random and not systemic) but at this point I am becoming more convinced that running experiments on nearly anything is the way to go.

machinethink · February 27, 2018, 10:20am

In one of the lessons, Jeremy mentioned a paper that studied the effect of noise in the training data (due to mislabelled examples, wrong images, etc). It turns out that deep learning is quite resilient to this kind of error, as long as it is random.

So if the amount (and kinds) of errors is roughly the same across all classes, then removing these bad examples or labels isn’t going to make much of a difference to the end result.

However, if the errors only happen in the cats examples and not in the dogs examples, for example, then it might make performance worse. (Because now the error is no longer random but systemic.)

saidaspen · March 1, 2018, 5:26am

Thanks both of you. Makes sense.
I will try it out on my cats and dogs and see what happens if I clean some of the problematic images out.

nishant_g · March 2, 2018, 5:50am

I think the above guidelines should be followed. But I still would like to point out, that in the Rossmann Store Sales prediction competition Jeremy talks about some outlier data points. The 3rd place winners didn’t notice it. But the first place winners realized that after and before store close, sales are unusually high. And since they don’t have the data for store closing, it can’t be modeled.

Summing up, I think removal is not going to harm you. If it is not time consuming you should do it.

Here is a post about best practices. The above point has been covered there with a link to the video where this topic was discussed.

saidaspen · March 2, 2018, 10:31am

Thanks for the info!
I will definitely try to clean out some and read up on the link you sent me.

lychenpan · December 24, 2018, 8:54am

How to judge whether the error are due to mislabeled data or the model itself?
Thanks,