Question about transfer learning dataset appropriateness

After watching the first lesson about the dog breed classifier I started researching how other people have tackled the problem. On Kaggle, specifically, there was a contest using the Stanford Dog Breed Dataset.

I’m under the impression that the Stanford Dog Breed Dataset was actually cut out of the ImageNet dataset. Does this mean one shouldn’t use this dataset with architectures that have been built with ImageNet?

In other words, trying to do transfer learning with something like ResNet50 on this dataset would be incorrect because the model was trained (has already seen) these images.

Thanks in advance for the clarification.

I think in Lesson 2, Jeremy explains that ResNet50 is more like an equation than a neural net. It takes up no space. It isn’t until you train and save that you have an actual neural net that takes up space on disk. So by analogy your question is along the lines of y=mx+b is good for graphing lines, should I use it for graphing lines? (I hope I don’t come off as sarcastic or rude. I’m just trying to explain best as I can with my limited understanding).

My suggestion would be to just listen through at least Lesson 3.5. After that, go through Lesson 1 in detail. It’s not a bad question at all. It just gets asked in Lesson 2.

Best of luck,

Thanks for your response. I think I wasn’t totally clear in my question. I understand that resent50 is simply an architecture (a unique combination of layers that works well for image classification). What I’m asking is more about transfer learning. About using the pre trained weights from something like resnet50 that has already used the image-net dataset for training. In this case, I don’t think it’s proper to train/test a new head with images already used… but I’m trying to get clarification on that. Thanks!

You are worried about overfitting?. Lesson 2 gets into that some. My impression is Jeremy loves transfer learning.I think his answer would be something like, “Try it, see if it works. What’s the worst that could happen?”

Sorry I couldn’t be more helpful. I reserve the right to totally miss the point and be wrong again.

I would be interested in reading what your reservations are and where they are coming from

If I understand your question well, you are saying ResNet50 has already seen dogs so why train it again for dog pictures. It already knows how to recognize them. True it does. But it does not know every category of dogs. Also, it knows a lot of other things apart from dogs. Hence we need to narrow it down to dog categories only. For this, we take a ResNet50 as our backbone and add some layers to it. We then train these new layers mostly, not changing the ResNet weights much. It’s sort of like narrowing down of knowledge. You know deep learning, but you want to study just image classification. I’d suggest you watch more videos and your doubts will be solved. Alternately, you can read my article on transfer learning


… I think the original question is about data leakage.

You’re splitting the dogs dataset into train and validation. The validation images must not be seen during training, as this would ruin your chance of judging wether your model overfits or not.

When doing transfer learning, you’re using weights that we the result of training on something like imagenet. The dogs image are part of imagenet. Therefore, it might happen that images from your validation set (now) were part of the training set that resulted in the pretrained model.
That would indeed be a problem, because the main assumption is that the model has not seen any images from the validation set, which would not be true anymore.

Yes! 100% correct @oneironaut. I did not know there was a term data leakage.

Therefore, it might happen that images from your validation set (now) were part of the training set that resulted in the pretrained model.

That’s what I was trying to get across. I can’t train ResNet50 on ImageNet and then use the Stanford Dog Database to train a new head because the Stanford images come from the ImageNet database. Perhaps that’s why Jeremy chose to use the Oxford dog database instead. Unfortunately, both of these databases only have a small number of dog breeds. Therefore it looks like, if my goal is to make an improved dog breed classifier, I need to build my own dog breed database.

This is the clarification I was looking for. Thanks!

Thanks for the response @dipam7, but see my clarification below.

If I understand your question well, you are saying ResNet50 has already seen dogs so why train it again for dog pictures.

Negative. Im saying ResNet50, trained on the ImageNet database, has already seen the specific Stanford Dog Database images because these images were pulled from the ImageNet database. Therefore, it would not be a valid attempt at transfer learning to use the Stanford Dog Database images because backbone as you call it, would have already seen those images.

It would be impossible to decipher generalization/overfitting because it would be like mixing your validation and testing datasets.

I would instead say it would be better, as the class was overall ‘dog’, and not each individual species itself, like what @dipam7 says. There shouldn’t be a reason not to, this is the same as training a model to learn all of Reptilia on just the initial Phylum classes, then using those weights to further classify down the chain of species, slowly increasing your number of classes via transfer learning.

And if it’s that concerning, try it. Go gather some images of the classes from a google search that are probably not in the dataset, usually anything new (in the past year or two) won’t be in it, and have your model be evaluated on that test set.

But I thought the imagenet database doesn’t have an overall class ‘dog’. Instead, it has multiple breed categories.

Anyway, I’m starting to second guess myself. If I take ResNet50, trained on ImageNet, and do transfer learning with a new head, as long as I leave the original ResNet50 layers frozen, I would think it wouldn’t matter if I pass through duplicate images? Since the new layers i’ve added haven’t seen these images. It’s the unfreezing part that would begin to show the model duplicate images…

But to your point, without a definitive answer here, I guess creating my own dataset of unseen images is the only way to completely ensure impartiality.

1 Like

Ah you are correct there! I apologize for not digging further into that…thank you for pointing out my mistake!

And it shouldn’t, for the exact reasoning you mentioned!

Lastly, yes. Most likely. But do keep us updated on those results if you could :slight_smile:

Will do!. I’m currently using Instaloader to grab images from instagram hashtags. We’ll so how it works out.

The problem is not “duplicate images”, it’s “images in the validation set that used to be in the train set”.

Your argument that the new head has not seen those images is correct. However, if you think about the Frozen part as a “feature extractor”, then the pretrained model may have learnt features from pictures that are now in your validation set.
Your new head could therefore use features that you normally wouldn’t have, because your model only learnt to extract them by training on the images that are now in your validation set.
Tldr: even before unfreezing you could have data leakage.

Two ways to fix this:

  1. Download some samples on your own and use them as your validation set.
  2. Look up which images were used in the validation set for training the original model, use exactly those for your validation set.