Hey guys! Many of us in ML work with multi-label data, where the image or text is tagged with multiple labels. Often these datasets contain frequent label errors and/or missing tags (check what we found below in the CelebA dataset) that make it hard to train highly accurate ML models.
I’m excited to share our newest research on algorithms to automatically find label errors in multi-label classification datasets. Image/document tagging represents important instances of multi-label classification tasks, where each example can belong to multiple (or none) of K possible classes. Because annotating such data requires many decisions for each example, often multi-label classification datasets contain tons of label errors, which harm the performance of ML models.
We’ve open-sourced our algorithms in the recent release of cleanlab v2.2. All you need to do to use them is write one line of open-source code via
from cleanlab.filter import find_label_issues ranked_label_issues = find_label_issues( labels=labels, pred_probs=pred_probs, multi_label=True, return_indices_ranked_by="self_confidence", ) # labels: list of lists of (multiple) labels of each example # pred_probs: predicted class probabilities from any trained classifier
Running the new
find_label_issues() function on the CelebA image tagging dataset reveals around 30,000 mislabeled images! Check out a few of them in the blog post!
- Blog post: Automatic Error Detection for Image/Text Tagging and Multi-label Datasets
- Paper: [2211.13895] Identifying Incorrect Annotations in Multi-Label Classification Data
- Tutorial: Find Label Errors in Multi-Label Classification Datasets - cleanlab
- Benchmarks: GitHub - cleanlab/multilabel-error-detection-benchmarks: Benchmarking label error detection algorithms for multi-label classification
- Code: GitHub - cleanlab/cleanlab: The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
If you have any questions please let me know:) Hope you find these practical tools useful in your real-world ML applications!