Validation set not reproducible in Kaggle?

I’m trying out Lesson 1 (dog and cat breeds) using Kaggle.

I understand from lecture 2 that the point of the np.random.seed(2)
call before creating the ImageDataBunch is to ensure we get the
same validation set every time. I’m trying to check this in Kaggle
(I add a cell evaluating data after the data = ImageDataBunch … line

  • this prints out the labels of the first few images in the validation set).

I find that the validation set is reproducible within the same kernel run
(every time I run data = ImageDataBunch after calling np.random.seed(2)
I get the same label list). However, after restarting the kernel I’m getting
a different label list for the validation set, even with the same seed.
Does this mean the validation set is different between runs?

There’s more than just that one individual seed we have to set when we want reproducible results. See here: [Solved] Reproducibility: Where is the randomness coming in?

1 Like

Thanks, but I thought that thread was if you wanted reproducible training results,
which I understand is more controversial - in lecture 2 Jeremy Howard says he
actually doesn’t usually advocate that. However, he says you do want the validation
set to stay the same, and that’s why they have the np.random.seed(2) call in the notebook.

Try this, when we do split_by_rand_pct(), we can pass in a seed after our validation pct. Try that :slight_smile: Otherwise, looking at the source code, it should be working like that. Just in case, try passing .split_by_rand_pct(seed=2)

Thanks! I tried adding seed=2 to the data=ImageDataBunch.from_name_re() call;
according to the source code this seed value should then be passed to split_by_rand_pct()

I’m still getting the same validation set within a session but not across kernel restarts.

To be clear I’m working from the Kaggle lesson 1 notebook forked from
https://course.fast.ai/start_kaggle.html