Cleaning large number of images with large number of classes

mrfabulous1 · February 21, 2020, 7:08am

Hi EnHakore Hope your having a wonderful day.

This forum is a wonderful place where we try not to describe other peoples work as useless.

Before some of the members created the confusion matrix and image cleaner widget life was even more difficult.

The problem you describe is faced by every single data scientist and ML practitioner. It is for this reason that many people use other peoples data sets to train their models,
such as the examples here: https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/

Many of these data sets have been hand curated (checked by humans), using services such as https://www.mturk.com/ where people pay others to check and label their data. Some people and organizations have actually spent years cleaning and checking their datasets.

Cleaning data is even a problem for companies such as google.

A search on the internet for something like ‘data cleansing’ may help.

Also if you read the posts in the thread below you will see some ideas you may be able to adapt to clean your data.

Handle data that belongs to classes not seen in training or testing

Here is a link to a fastai2 notebook predicting unknown labels I have actually tested this in fastai2.

If you find some good ideas maybe you could write a blog or a widget that could help the rest of us in the community handle large messy datasets.

Cheers mrfabulous1