Hey there,
I’m currently working on the histopathologic cancer detection competition in kaggle.
One thing that bothers me is the big size of the training data. It takes ages to work with 220k images.
So I’m mostly workin with a sample of <= 50k images that I obtain by taking e.g. the index from 50000 - 100000 in the csv that provides the labels.
My question is now: Should I create one validation set that I validate my models on that is always the same size or is always choosing a new validation set depending on my sample size sufficient?
To give you an example: If my sample I’m working on has a size of 10k images the validation set will consist of 2k images chosen randomly. If I then increase my sample size to 20k my validation set will be a new one consisting of 4k images. So I’m not validating the model on the same images. Is it a good idea to build a validation set of 44k images (20% of the whole 220k images) to validate my model on EVERY TIME even if I train it on a subsample of the original 220k images?
And when do you decide it is time to train the model on the whole data? And how long then since you have to check in from time to time to prevent the model from overfitting because you train it too long.
What workflow do you use to see what works and what doesn’t when working with large data sets? Jeremy has said on multiple times that he works with samples too but not exactly how he works with them…