Removing noisy examples


(Ben Johnson) #1

Hi All –

I have a binary image classification task w/ a fairly large dataset (> 100Ks of images from each class). However, there are a lot of “uninformative examples” in the training set – images that do not contain enough information to make a class prediction. An example of this situation would be trying to build a classifier to determine whether a picture is of a cat or an airplane, but > 50% of the training images don’t contain either a cat or an airplane.

I’m trying to figure out how to proceed. A reasonable method might be to:

a) learn a classifier to determine whether a training example is "informative"
b) filter the training data to only “informative” examples and train a standard classifier

but I can’t figure out exactly how to implement that (eg because I don’t have any labels on whether something is informative or not).

I know that CNNs can be robust to massive label noise, so it’s possible that if I trained a classifier on this noisy data, all of the “uninformative” points would get forced to p=0.5 and everything would be OK. I actually tried doing this on CIFAR-10 – train a classifier to distinguish a cat (0) from an airplane (1) but w/ the other 8 classes randomly assigned 0/1 labels. The distribution of predicted probabilities ends up being tri-modal – a low mode for cats, a high mode for airplanes and a middle mode right at 0.5 for the other classes. So that suggests that throwing all the data into a model and praying for the best could work – I’m training that model now, but wanted to see if anyone here had any ideas in the meantime.