Automatic detection of mislabeled (or which at least add noise) images in the training dataset

Hi all,

I’m used to working with tabular data. In this kind of data, it is common to semiautomatically analyze row-wise outliers through PCA, uni and multivariate outlier detection and so on. It seems that in Image Recognition the removal of row-wise outliers is much more difficult as the information in the images is not row-wise aligned. Therefore, a lot of time might be spent analyzing visually and removing images from the training dataset.

I’ve just thought of a method to try to automatically detect potentially mislabeled images in a training dataset by making a crossvalidated prediction of these images. When using a subset of the training dataset to try to predict the label of the remaining subset, the prediction should concord with the already associated label. If not, this might be an indicator of a mislabelled image.

For example, imagine I have 10 dogs labelled as cats in a training dataset of 5000 cats and 5000 dogs. Now, I subset 2500 cats (with 5 mislabeled dogs) and 2500 dogs of this training dataset to predict the remaining 2500 cats (with 5 mislabeled dogs) and 2500 dogs. During prediction, I should find 2495 cats and 2505 dogs, as the model has correctly labelled the mislabeled dogs.

If performing this kind of cross-validation k times, we can have a lot of predictions which ensure that the unexpected prediction is consistent. Computing time can be highly reduced using lax hyperparameters as it is not a priority maximizing the metric quality.

I’ve generated a Google Colab to show how mislabeled images can be detected:

  • when shuffling dogs and cats in the dogscats dataset
  • in a dataset of men and women images collected by an automatic download from Google Images.

I’d like to know if you find the idea interesting or better implementations of the concept are already used but I hadn’t found any examples by now.