Why is the dogs cats dataset split as it is?


(Tom Hale) #1

In dogscats.zip, I see the following count of files:

 11500 data/dogscats/train/dogs
 11500 data/dogscats/train/cats
  1000 data/dogscats/valid/dogs
  1000 data/dogscats/valid/cats
 12500 data/dogscats/test1

In Andrew Ng’s Machine Learning course, he says to use a 60/20/20% train/valid/test set split.

Any reason why the validation set (2000 examples) is less than the test set (12500 examples)?


(Yijin) #2

The data is from this Kaggle competition: https://www.kaggle.com/c/dogs-vs-cats/data

The test set for the competition has 12500 images; no real “reason” for that, as far as I know, except it being how the competition was set up~

The labelled data has 12500 images each for dogs and cats, and you can split it into training-validation however you want, e.g. following standard 80%-20% rule.