Lesson 2 - verify_images - Data quality

imrandude · March 26, 2020, 9:26pm

Hi,

In lesson 2, after downloading the images from Bing API, the images are validated using function - verify_images.
What it really does is it tries to open a file, from a location, as Image and if its successful adds to a list. This list is returned as output of the function.
Now, this is more or less a quality check of the input image, to be fed into a model.
Apart from the usual suspects, like

Channel count of the image file
Dimensions of the file
Channel type (RGB vs others)

Is there, sort of any list of checks, to be done with respect to input image quality?

I did see ths post from Google AI on assigning a vector score, in line with the image quality and aesthetics, could find much about anything else on the topic.

Has any other member tried anything in this front? Any input would be much helpful.

radek · March 27, 2020, 9:48am

DL models are very good with working with all sorts of data. They also deal very well with noisy labels (where some images are incorrectly tagged, as long as the errors are not systemic but are random).

I wouldn’t worry too much about input quality in the technical sense. The most important consideration is that your train set is representative of the data the model will see in the wild. And that you also have a good validation set. That is probably where you would get best ROI on attention / time invested.

imrandude · April 13, 2020, 3:45pm

Thank @radek, I am making a small write up of standard operating procedure on how to deal with for deep learning image related problems.
And was exploring something deal with this, Understanding How Image Quality Affects DeepNeural Networks, where in the researchers explored how differences in quality of Images resulted in wrong prediction of the network.

What I understand is that is a separate topic called Image Quality Assessment, which deals with predicting the aesthetic appeal of an Image.

Further, apparently there also traditional ML algorithm which are used for Image cleaning, for example ImageSetCleaner, uses k-means, GMM, Agglomerative Clustering, to detect wrong labels.

There are also other image similarity clustering options like TSNE and Umap, to visualize and enable cleaning the image.