Lesson 6: PointBlock example and Not Using Random Splitter?

Hi everyone,

In the PointBlock example in Lesson 6, I have a hard time understanding why a random splitter should not be used. Is that a general rule where the images of the same person should be used to train first before moving to the next person? Why random splitting is a bad thing in this case?


If it’s split randomly, you will likely get 80% of the data for every person, in the training set. This means that you will have data ‘leakage’, and the model can ‘cheat’ (re: validation) because it has seen some images (~80%) for everyone, including in the validation set. The result will be that your trained model is not generalised enough and will not be able to deal with inferencing for a true blind test image (for a person that it has not seen before).

Rachel’s article here explains it very well. Thanks.