I built a dataset curator to help find and remove both duplicate images and images from outside of the data distribution. It uses the intermediate representations from a pretrained vgg network (similar to content loss when doing style transfer).