Automatically Detect Annotation Errors in Image/Text Tagging Datasets

cmauck10 · November 29, 2022, 9:10pm

Hey guys! Many of us in ML work with multi-label data, where the image or text is tagged with multiple labels. Often these datasets contain frequent label errors and/or missing tags (check what we found below in the CelebA dataset) that make it hard to train highly accurate ML models.

I’m excited to share our newest research on algorithms to automatically find label errors in multi-label classification datasets. Image/document tagging represents important instances of multi-label classification tasks, where each example can belong to multiple (or none) of K possible classes. Because annotating such data requires many decisions for each example, often multi-label classification datasets contain tons of label errors, which harm the performance of ML models.

We’ve open-sourced our algorithms in the recent release of cleanlab v2.2. All you need to do to use them is write one line of open-source code via cleanlab.filter.find_label_issues.

from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(
    labels=labels,
    pred_probs=pred_probs,
    multi_label=True,
    return_indices_ranked_by="self_confidence",
)
# labels: list of lists of (multiple) labels of each example
# pred_probs: predicted class probabilities from any trained classifier

Running the new find_label_issues() function on the CelebA image tagging dataset reveals around 30,000 mislabeled images! Check out a few of them in the blog post!

Resources:

If you have any questions please let me know:) Hope you find these practical tools useful in your real-world ML applications!