Adjusting Training and Validation Set Size for Rapid Prototyping

I am working on the Histopathology challenge using v0.7 of I am trying to figure out how the number underlined in red is calculated. My approach has been informed by several excellent kernels by students, including [this one].(

My reason for asking is that I want to follow @radek’s advise (outlined on Twitter) about getting into flow, when working on Kaggle competitions using a deep learning format. Assuming I’ve understood his approach correctly, he seems to advocate creating subsets of the training and validation set, such that the time to train one epoch is less than 10 seconds.

The purpose of such an approach seems to be so that you can rapidly prototype ideas for fine tuning your model. Once you have an approach that works well, you can return to your original datasets.

I’ve adjusted my training set by creating an alternate training label csv file with a subset of the data. I’ve adjusted my validation set by creating a new directory and filling it with a subset of the validation set by using a bash script.

Regardless of the sizes of my subsets, the 2751 number remains consistent, every time I run the code. My hunch is that this will come down to a debugging issue, however, I am at a loss, since I do not know how the number in question is calculated in the first place.

It is the number of batches. You must have selected some batch_size so 2751 number is = training_images / batch_size. It will remain constant while training but will be different for training set and validation set.

1 Like

This number should change if you are training on a subset of the data with same batch size… mhmmm.

On the other hand, the 10 second rule does not seem to work that well for deep learning. What seems to work best for me nowadays is training on smaller input size and for just a couple of epochs. I also try to separate the phase where I am working on dataset / architecture changes (in this phase run minimal training, its the time I spent in front of the computer hopefully doing something useful) from the phase when I run the training and try out things (when I am away from the computer, at night, or doing something else). This sometimes works better, sometimes not so much.

I think the trick is to find out what works best for you given your habits, the compute resources you have (will probably be different for someone with multiple GPUs, I mostly work on a single GPU rig) and the goals you are after. Meaning, my approach still evolves and I am not sure unfortunately that there is a set of rules one can follow, but certainly giving one’s attention to how we spend our time when working on a project and taking inspiration from how Jeremy goes about doing things helps a lot.