# They are the same pictures

Is there a mathy way to tell if two image datasets are essentially from the same distribution. Or we just have to look at them and subjectively tell that they are the same picture.
e.g.
dataset-1: pictures of bears
dataset-2: sketches of bears
I am using an extreme example, but one way is just look at them and tell they are different, on the other hand can we use some math here like KNN (n=2).

A simple way is to use a pretrained ImageNet model to create feature vectors for all the images (you need to remove the classifier layer from the model for this), and then you can compute distances between the images to see how similar they are (L2 distance, for example).

Since you have more than one image in each dataset, you could take the average of the feature vectors from dataset-1 and compare it to the average from dataset-2. But it might be smarter to first find N clusters in each dataset, and then compare the distances between the clusters or something.

1 Like

Thanks a lot Matthijs,
This will be very useful for me, I can use many rounds of this idea and get much pure dataset set after each round.
Also, If I use me as a pre-trained NN or ImageNet as a pre-trained ANN, it will defeat/hack my purpose of pure automation. I want to be very dumb and want to use only math everywhere.

A pretrained NN is only math. 1 Like