Suggested reading for everyone: Splitting your data, a blog post on the Distracted Driver Competition

I have noticed a lot of people are having a hard time splitting their train/valid/test data. I find a lot of people are using pictures of the same people, objects, animals across all of their train/valid/test splits, which will not lead to good generalization. Jeremy talked about this a few times, but I feel that it has been missed.

I first learned how to do this by working on the State Farm Distracted Driver Competition. I did not write a post about it when I did it, so I suggest reading this one I found on medium: https://towardsdatascience.com/how-i-tackled-my-first-kaggle-challenge-using-deep-learning-part-1-b0da29e1351b

It is a short read, but should give you a little insight into how you need to split your train/valid/test sets. Getting this right is paramount to not being overconfident in the performance of your model.

3 Likes