I trained my model to classify salt and sugar images. I downloaded 200 pictures of each (as clean as I could), I trained it for 5 epoch on top of resnet34 model and it shows impressive 10% error.
Now, I want to give it some tricky image to predict on. So I will have to take it from google images of course, but how do I make sure that my model didn’t see it before? Because I don’t know on which 200 pictures exactly it trained.
One of the ways I can think of is take an image from validation set, because it wasn’t training on it. So how do I know which exactly images were in the validation set?
Or is it all not important and model actually doesn’t remember the images it was training on?
The model certainly “remembers” the images in the training set. The whole point of creating a validation set is to test the model on images that it hasn’t seen (=remember) yet.
So if your goal is to test your model, you should not take an image from the training set. The idea of using an image from the validation set would work.
Another option is to take an image from google-images and check if its URL is in the file you created at the start.
EDIT: You can access the validation set from the ImageDataBunch:
img,lbl = next(iter(data.valid_dl)) #img is [batch_size, height,width, channels]
img0 = img[0,:,:,:]#extract only one image
To add to what oneironaut said, you can also use a test set.
When you divide your dataset into a training set and a validation set, you do it to train the model on the training set, and then adjust the hyperparameters (like the learning rate, the regularization parameters…) to the validation set. In that case one could argue that the model also indirectly used the validation dataset, through your adjustement of the hyperparameters. So you could also use a third slice of your original dataset, the test set, that you never use to change anything on your model. You train or the training set, you ajust hyperparameters using results from the validation set, and then you use the test set to test the model on totally new images.
Thank you for your answer!
The option to check URLs sounds good, but it’s not 100% reliable because URL of an image might have changed, right? Or I’ll use the image that is in the dataset but under different URL.
The option to access the validation set would be perfect if I could actually see the images.
My goal here is to choose a hard to classify image by looking at it myself first, then check that the model didn’t train on it and then test my model on it to see the prediction score.
In that case you can just look at the worst-classified images from the validation set, right? That’s built into fastai:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix() #-> to see how many misclassifications you have
interp.plot_top_losses(6) #->pick a number that matches the misclassifications from above
This way you get to look at all the pictures that your classifier had trouble with.
It would be perfect if I could actually plot lowest losses, so I could find images that the model is classifying well, but are hard to classify for humans. Unfortunately, I can’t see that option.
https://www.duplicatephotocleaner.com/ try this to detect duplicated images.