Saving/restoring the training and validation sets, or the databunch

Pomo · January 18, 2019, 1:09am

I would like to make an initial split of the source training data into training and validation, for example,
data = ImageDataBunch.from_csv(csv_labels='train_labels.csv', suffix='.tif', path=DATA, folder='train', test='test', ds_tfms=None, bs=BATCH_SIZE, size=96).normalize(imagenet_stats)

Then in effect save the DataBunch (its particular training/validation split) to restore later into a fresh DataBunch.

The reasons are 1) to continue training without mixing training data into validation by doing a second, different split; and 2) to compare different models using the same training data.

I imagine this involves correctly saving/restoring the list of filenames and labels that were chosen during the initial DataBunch creation.

Because I’m overwhelmed by the fastai internal details (sorry!), I’d appreciate seeing the exact code that accomplishes this task. Thanks so much for your help.

sgugger · January 18, 2019, 1:27am

There is no command that does that yet (it’s in the features we will implement next though). In the meantime you have two workaround:

set the numpy seed just before creating your Databunch: np.random.seed(some value). This will ensure you always have the same random split (and the same validation set).
use the data block API and either pass validation indexes you choose or save the first random indexes (more advanced)

Usmabhatt · January 27, 2021, 5:13am

Does the training data and validation data have the same data distribution as the original?