During the preparation of a databunch, is there a reason that the fastai library splits data into the train & valid sets before connecting images with class labels?
Wouldn’t it make more sense to put a similar percentage of each class of image into the train and valid datasets?
Has anyone posted a blog where they deal with the problem that could arise if you have one class with many fewer images than other classes? I think Jeremy mentioned that you could make more copies of the images that had low representation. I understand that if each copy is transformed differently that would reduce overfitting.
I wonder what the limits are? E.g. if you have 100 images for most classes and only 10 for another, can neural nets do a good job of identifying new images that match the lowly populated class? This seems like a fun experiment!
Please share if you have blogged on this topic.