Locking data to test and validation set!

Bjorn · October 7, 2019, 6:38am

Hi

I have trained a Neural Network and when I look at the validation data(top losses) I can find images that doesn’t have any corresponding images in the train set. So what I want to do is dump a number of images into my train folders(similar but not same as the validation image). But when I create the new databunch I want all the old images split in the same way.

Example:
Picture: A(train set), B(validation set)

Add picture C and D,
new DB, A(still in train set), B(still in validation set), C(randomly moved to validation set), D(randomly moved to train set).

Also if I delete a few bad images it should not affect how the images are placed when creating the DB. For example

A(train), B(valid), C(train), D(valid)

A(still in train), B(still in valid), C(deleted), D(still in train).

The reason for this topic is that when I restart my Notebook, I want to avoid cross contamination of images from the test set ending up in the validation set when I train more.

I thinking of creating a known split CSV file or similar, does fast.ai have this type of feature?

Ex:

Filename A: Valid
Filename B: Test
Filename C: Train
etc…

VishnuSubramanian · October 7, 2019, 7:32am

you can check split_by_func to do it. You can store all the ids that you want to go to train / val.

Bjorn · October 7, 2019, 12:32pm

Is this a function i should do a pull request for and add to fast.ai library?

VishnuSubramanian · October 7, 2019, 3:53pm

I rememer using it in the older version, so I guess it should be available. If you have not updated your version recently try doing so. You can use this function insteas of rand_by_pct.