Good train/dev/test split for 100k image dataset?

vdefont · May 11, 2021, 1:43am

I have a dataset of 100k images which I aim to classify into 120 categories. Andrew Ng provides some useful advice here: Splitting into train, dev and test sets

The size of the dev and test set should be big enough for the dev and test results to be representative of the performance of the model. If the dev set has 100 examples, the dev accuracy can vary a lot depending on the chosen dev set. For bigger datasets (>1M examples), the dev and test set can have around 10,000 examples each for instance (only 1% of the total data).

There are a couple options I am considering:
(train / dev / test)
(A) 90k / 5k / 5k
(B) 80k / 10k / 10k

Which do you think makes more sense, and why? In general, is there a way to check if your validation set is too small and to make it larger later on? (say, if there is too much variance in your accuracy when using the same model)

VishnuSubramanian · May 11, 2021, 2:07am

100k is a good number. So a 10k for Val & test should be fine. However, understanding the data distribution would help in a more meaningful split. Let’s say you have patients data, then you need to ensure all the images for a particular patient stay either in train or validation. If you randomly distribute these kinds of data, then there would be leakage and the model would perform well on only the data in which it has been trained.

There is class imbalance too in some domain problems. So in those cases, you have to explore smarter ways on how to distribute the data.