Working with subsamples of original data

rasmus1610 · January 29, 2019, 11:58am

Hey there,
I’m currently working on the histopathologic cancer detection competition in kaggle.

One thing that bothers me is the big size of the training data. It takes ages to work with 220k images.

So I’m mostly workin with a sample of <= 50k images that I obtain by taking e.g. the index from 50000 - 100000 in the csv that provides the labels.

My question is now: Should I create one validation set that I validate my models on that is always the same size or is always choosing a new validation set depending on my sample size sufficient?

To give you an example: If my sample I’m working on has a size of 10k images the validation set will consist of 2k images chosen randomly. If I then increase my sample size to 20k my validation set will be a new one consisting of 4k images. So I’m not validating the model on the same images. Is it a good idea to build a validation set of 44k images (20% of the whole 220k images) to validate my model on EVERY TIME even if I train it on a subsample of the original 220k images?

And when do you decide it is time to train the model on the whole data? And how long then since you have to check in from time to time to prevent the model from overfitting because you train it too long.

What workflow do you use to see what works and what doesn’t when working with large data sets? Jeremy has said on multiple times that he works with samples too but not exactly how he works with them…

Pomo · January 30, 2019, 12:17am

Hi Marius. I will take a shot at your many good questions. My qualifications are not much - starting the 3rd time thru the course and a slow, methodical, and otherwise busy householder.

First, I think that the main purpose to use a subsample for training is to quickly get a sense of which architectures and parameters will work well. Without spending a lot of time, you can get a sense of what layers and settings to try. Then switch to the full training set.

Your method of subsampling concerns me. If the Training CSV is not randomized, you will be testing on a biased sample.There’s a DataFrame method that will choose a (pseudo)random sample and another method that will write to CSV. These will help you create a random Validation set.

A smaller Validation set will have more statistical variance from the Training set, but also leave more examples available for training. The main issue however is not to let samples from the training set leak into the validation set. (This assumes you will be continuing training with a larger subsample.) Otherwise the model will have already learned specific samples of the validation, and you will see an inaccurate validation error (too small). Therefore best practice is to separate out a validation set before training, and use all or part of it for all phases of training. AFAIK. there is not yet an easy way to save and use a fixed validation set directly in fastai. But you could do it manually with Pandas, CSV, and folders.

Validating after every epoch is for your own assessment of training. It does not affect the training itself, but lets you detect and stop overfitting when validation error stops decreasing while training error keeps decreasing. If you want to validate less often (and likely get lower error), you’re better off training one epoch on the full training set than 7 epochs on a subsample.

Once you have finished training (using the Validation set as feedback), many practitioners will train a few epochs with the full training set. This last step sometimes boosts accuracy a bit.

As for your other questions - how long, what works - I think there’s no general rule. You discover by practice and observation. Maybe others will chime in with their own experience and advice.

One last observation… I have run this Kaggle dataset locally. Using GTX 1070, one epoch of the full training set takes 7 to 23 minutes, depending on the complexity of the model. You might want to look at why yours is taking ages, and perhaps use a simpler model.

Hope this helps a bit.