I think it’s a lot to do with trying to make the validation test as close to the test set as possible. You can study the way it’s organized and then come up with a script that roughly takes ~20% away from the train set.
This notebook is really helpful. Thanks @karthik_k314. Quick question. This chooses 5 random drivers and moves all of their data to the validation set. Is there any benefit to this strategy over just randomly choosing 20% of the images from the entire corpus of training images to move over?
Thing is that you should not have any driver both in Validation set and Training set. Not having same image in both sets is not sufficient.
The idea behind is that if same driver (with different distraction) will appear in both sets, it will be “easier” for the trained model to predict the same driver in the validation - even if distraction is different. If you separate competently the driver in both sets, you make sure the trained model is able to correctly predict a driver that it never saw before.
We’re supposed to classify an image into one of 10 states (including one which is safe driving). I would guess training on the driver with some images and using images of the same driver in the validation set might cause the validation set to focus on the accidental regularities (i.e. the driver’s hair color) and ignore the features we really care about. However, if there was enough images of the user (at least one per class in the training set) this might not be an issue. I’d love your thoughts.
Practically speaking I suppose you wouldn’t expect to see the same driver at training and test time.