What does np.random.seed(2) do?

kodzaks · March 26, 2019, 3:12pm

Why do we need random seed at all? I run into the issue with reproducing the same results, because every time I run the model (it selects images randomly for validation) I get different results. I was told that the solution is to use random seed. I understand that it might be good to get similar results, but, ultimately, the model will not generalize well and will do poorly with brand new images (which is exactly the case for my images). Or am I completely wrong and just misunderstood what random seed does? If the model has such big discrepancy in results based on what random images end up in validation set, does it mean that even the best results are some sort of nonsense and just random hit.

Consequently, in previous version of part 1 we had a folder for validation images, in this version validation set is drawn from the overall images set. So what is the best way to set up a validation set?

Jarmos · March 27, 2019, 10:38am

First thing, code reproducibility especially in the case of Data Science isn’t an obligation but rather for the sake of research and making sure the other readers are on the same page. You can read up more on the relevance of reproducibility in this article, The Relevance of Reproducible Research. When you’re a production build though it’s advisable not to use the random seed.

Secondly, as far as I know, the fastai library’s core aspect has a data_block object through which you can create a very customized DataBunch object. Let me quote a specific use case right from the docs.

In vision.data , we can create a DataBunch suitable for image classification by simply typing:

data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=64)

This is a shortcut method which is aimed at data that is in folders following an ImageNet style, with the train and valid directories, each containing one subdirectory per class, where all the labelled pictures are.

Here is the same code, but this time using the data block API, which can work with any style of a dataset.

data=(ImageList.from_folder(path).split_by_folder().label_from_folder().add_test_folder().transform(tfms, size=64).databunch())

Read up more on it over here.

Briefly explained what ImageDataBunch really does is that it creates a DataBunch object out-of-the-box where the dataset is already preformatted into train and valid sets. But most real world datasets doesn’t come in that nice and easy-to-use format that’s where the data_block object comes handy where you will to have specify if/which part of the dataset to use for validation.

racket99 · February 14, 2020, 12:31am

I understand the concept of seeds in random number generators, but in this case even though we provide a seed, each time we call data.show_batch, we get different images? If we have a seed, shouldn’t we always see the same batch?

Jarmos · March 3, 2020, 11:20am

Sorry for the delayed response, I’ve not been actively developing on fastai library anymore or probably until v3 goes live.

But here’s the thing. The motive of using an RNG along with a seed is to make uniform outcomes for multiple users. It’s already been mentioned by Jeremy in one of the tutorials, I believe that code reproducibility is a thing if you share your code with someone else. Other than that you will not have to bother about the np.random.seed(2) code snippet ever.

in this case even though we provide seed, each time we call data.show_batch, we get different images?

But to answer your question, no not necessarily. np.random.seed(2) is a function to maintain code reproducibility not to show the same set of images every time you call data.show_batch.

Pseudo-RNGs are a rabbit hole and if you go down it, you will only keep going in and in. So it’s best if you let it that way & keep in mind you need to use an arbitrary number across all your scripts for np.random.seed, if you care to share it with someone.

dlr10 · June 12, 2020, 3:10pm

racket99, I got the same confusion at the beginning. But they are sepparate functionalities. In fact, seeds fixed the division of your data set, and you training and validation will be always the same. However, data.show_batch is just a visualization tool which shows you random sample from your trainning set. i.e. if your trainning set has 1000 samples, show_batch will show you 12 - 9 random sample from those 1000, but those 1000 in your trainning set are always the same. If you want to change your 1000 images, you should change the number inside the seed.

Hope this was clear