General course chat

During the preparation of a databunch, is there a reason that the fastai library splits data into the train & valid sets before connecting images with class labels?

Wouldn’t it make more sense to put a similar percentage of each class of image into the train and valid datasets?

Has anyone posted a blog where they deal with the problem that could arise if you have one class with many fewer images than other classes? I think Jeremy mentioned that you could make more copies of the images that had low representation. I understand that if each copy is transformed differently that would reduce overfitting.

I wonder what the limits are? E.g. if you have 100 images for most classes and only 10 for another, can neural nets do a good job of identifying new images that match the lowly populated class? This seems like a fun experiment! :smile: Please share if you have blogged on this topic.