A subtle mistake in Chapter 7

This concerns the get_dls function from the book. It uses a random splitting instead of GrandparentSplitting. (In itself this might be OK, you get a 80/20 split instead of a 70/30 split). The main issue is when you try to use this function for progressive resizing.
Let me explain: I have only observed there is something strange when started experimenting on a Kaggle notebook with Imagenette and getting a way too high 98 percent accuracy. The issue is a subtle data leakage. When you resize and use get_dls again you get a different random split. So the new validation set is different from the old validation set. This mean that your model that you now are trying to improve has already seen most of the new validation set.
Of course it is easy to fix this. Just use GrandparentSplitting or a fixed seed in random splitting. I hope this helps. Zoltan

2 Likes

Hey @muellerzr, hope you dont mind the mention.

Can you confirm this?