Validation set instability - [possibly solved]

ikedim · July 10, 2019, 5:11pm

In lecture 2, Jeremy said it was desirable to keep the validation set the same
between runs - and this was the reason for the np.random.seed(2) call before
creating the DataBunch. However, in working on the lesson 1 notebook in Kaggle
and in Floydhub I noticed an apparent problem - the validation set did stay
the same when re-created with the same seed within the same session,
but it changed between kernel runs, even with the same random seed and even
when passing num_workers=0 to ImageDataBunch.from_name_re

After some poking around, I think I found the problem - the fnames list
returned from get_image_files seems to be in a different order in different
kernel runs. I’m not sure of the reason for this, but calling fnames.sort()
before creating the ImageDataBunch seemed to fix the problem - the validation
set now stays the same between kernel runs. I think this should be OK,
since split_by_rand_pct applies a random permutation before doing the split.

Could I suggest that since a silently changing validation set could lead to
some mysterious symptoms, maybe the default behavior of split_by_rand_pct
should be to sort its input list and call np.random.seed(2) (or some other
default seed value) before doing the permutation so that you get a stable
result by default?