What does np.random.seed(2) do?

Hi everyone, I am also having trouble understanding this line.

Mauro, when you say “random block of the validation set data” what exactly is the “random block”?

1 Like

The ImageDataBunch creates a validation set randomly each time the code block is run. To maintain a certain degree of reproducibility the np.random.seed() method is built-in within the fastai library.

What Mauro meant by, “random block of the validation set data” was that each time you might want to reproduce your code, ImageDataBunch would automatically choose a random chunk of data from the original dataset. This could be bad or good(you never know :man_shrugging:t4:) according to your use case. Jeremy talks briefly in lesson 2 about it.

So in order to have some control and predictability in WHICH chunk of data should ImageDataBunch create the validation set from, the np.random.seed() is used.

9 Likes

Hi thanks @jarmos - I think I understand. It sounds like there is a specific set (or block) of images that the validation set gets randomly chosen from? So whenever I call that factory method to create a new image data bunch, the seed makes sure that we always choose from this same set of images? Instead of changing each time?

4 Likes

Yeah exactly!! You got it right this time around but I still don’t understand the significance of the number that’s passed as an argument to the seed() function. I mean why 2 specifically and not some other arbitrary number like say 9000.

I asked around a couple of places but couldn’t get a proper explanation for it.

1 Like

It can be any number. Just that, when the same number is used again, the randomization is repeatable.
(so that the same set of images are picked by ImageDataBunch or in any other use of the randomizer, the numbers generated are repeated)

3 Likes

A random set or block of images. In this case the number of images inside this set or block depends on the length you assign to your validation set. If you have a total of 100 images, and you assign your validation set to be 20%, then your random block will have 20 random images from the 100 total.

2 Likes

Thanks everyone! I think I’m clear on most of that now. Although, still not clear though on the seed number. I understand now that not changing it keeps the block consistent between data bunch creation. But what does the actual number represent? Is it just some sort of id? Or is it numerically significant somehow?

I did some digging around. Here’s what I found from the NumPy documentation:

Seed the generator.

This method is called when RandomState is initialized. It can be called again to re-seed the generator. For details, see RandomState .

Parameters: seed : int or 1-d array_like, optional

So I checked the documentation for RandomState as well and here’s what I found:

Compatibility Guarantee A fixed seed and a fixed series of calls to ‘RandomState’ methods using the same parameters will always produce the same results up to roundoff error except when the values were incorrect.

I know it sounds confusing(at least to me) but from what I understand, is that, “a fixed seed” is necessary for RandomState to produce the same results every time. In other words once you execute the code block with np.random.seed(2), DON’T change the seed argument to any other number if you want to reproducibility.

I also found a StackOverflow post - What does numpy.random.seed(0) do? which you can refer to understand the underlying mathematics involved, if you wanted to but I would strongly advice against it since it’s not relevant to the course.

Quoting from that answer, here’s what it had to say:

(pseudo-)random numbers work by starting with a number (the seed), multiplying it by a large number, then taking modulo of that product. The resulting number is then used as the seed to generate the next “random” number. When you set the seed (every time), it does the same thing every time, giving you the same numbers.

1 Like

Note there are other randomisers in play for other libraries in use that you may want to set to constrain results further. Eg See Accumulating Gradients

Thanks jarmos. I’d still like to find out what the significance of the actual value is. I understand how changing it affects the reproducibility. But what is the difference between a 2 and say a 4 for the seed, or even some other arbitrary number like 10938? That’s kind of the place I’m stuck with on my understanding of that.

Seed is the starting point of the randomizing variables. Some initial number that computes next values by np.random.rand() function. Normally when you call function np.random.rand() the pseud-ogenerator generate you random number every time.

image

But when you set up a seed the output of the function will be always the same

image

Based on this function the data set is divided between train and validation. You can set up this value to whatever you want, 2 is only as an example, but it must be the same value every time you run and load your DataSet to be sure that you will get the same output.

And also something interesting. You can even set a seed for torch.
torch.manual_seed(2)

After this, you will get the same results in your neural network,loss,accuracy and the same parameter values.

2 Likes

In short, nothing.

From https://stackoverflow.com/a/22639752, emphasis mine:

Pseudo-random number generators work by performing some operation on a value. Generally this value is the previous number generated by the generator. However, the first time you use the generator, there is no previous value.

Seeding a pseudo-random number generator gives it its first “previous” value.

2 Likes

Coincidentally even I was confused, for a while, with the choice of arguments for the seed() function. I gave up quickly since I realised that it’s taking me off-track and that’s something I wouldn’t spent much time on. But regardless here’s some resources I have had bookmarked since then.

Understanding the meaning of seed in generating random values?
This StackExchange answer gives a fairly straightforward answer assuming the reader has atleast high school level mathematical understanding.

Random Seed - Wikipedia

Pseudo-random number generators - Khan Academy

If you read up on these resources I believe it should be enough to get you on track to understand the underlying concepts of np.random.seed(). But REMEMBER these concepts aren’t relevant to the course in anyway and the more you dive in, the more you would get confused since then you would have brush up on mathematics and computer science concepts.

I see: so you are drawing the validation set from the original data set, and that validation set is determined at random? I get it in concept, but where does the generation of the validation data set (I am guessing it is ImageDataBunch) reference the seed? It is not a parameter passed to it.

The argument passed to seed() is quite complicated to explain and I don’t know the mathematics behind it either. All that I know is np.random.seed(2) yields a very different set of validation set if you pass np.random.seed(198) at some point of time.
The point is that the argument passed to seed() is arbitrary but it is important to bear in mind not to change it ever for future reproducibility.

1 Like

Why do we need random seed at all? I run into the issue with reproducing the same results, because every time I run the model (it selects images randomly for validation) I get different results. I was told that the solution is to use random seed. I understand that it might be good to get similar results, but, ultimately, the model will not generalize well and will do poorly with brand new images (which is exactly the case for my images). Or am I completely wrong and just misunderstood what random seed does? If the model has such big discrepancy in results based on what random images end up in validation set, does it mean that even the best results are some sort of nonsense and just random hit.

Consequently, in previous version of part 1 we had a folder for validation images, in this version validation set is drawn from the overall images set. So what is the best way to set up a validation set?

First thing, code reproducibility especially in the case of Data Science isn’t an obligation but rather for the sake of research and making sure the other readers are on the same page. You can read up more on the relevance of reproducibility in this article, The Relevance of Reproducible Research. When you’re a production build though it’s advisable not to use the random seed.

Secondly, as far as I know, the fastai library’s core aspect has a data_block object through which you can create a very customized DataBunch object. Let me quote a specific use case right from the docs.

In vision.data , we can create a DataBunch suitable for image classification by simply typing:

data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=64)

This is a shortcut method which is aimed at data that is in folders following an ImageNet style, with the train and valid directories, each containing one subdirectory per class, where all the labelled pictures are.

Here is the same code, but this time using the data block API, which can work with any style of a dataset.

data=(ImageList.from_folder(path).split_by_folder().label_from_folder().add_test_folder().transform(tfms, size=64).databunch())

Read up more on it over here.

Briefly explained what ImageDataBunch really does is that it creates a DataBunch object out-of-the-box where the dataset is already preformatted into train and valid sets. But most real world datasets doesn’t come in that nice and easy-to-use format that’s where the data_block object comes handy where you will to have specify if/which part of the dataset to use for validation.

4 Likes

I understand the concept of seeds in random number generators, but in this case even though we provide a seed, each time we call data.show_batch, we get different images? If we have a seed, shouldn’t we always see the same batch?

Sorry for the delayed response, I’ve not been actively developing on fastai library anymore or probably until v3 goes live.

But here’s the thing. The motive of using an RNG along with a seed is to make uniform outcomes for multiple users. It’s already been mentioned by Jeremy in one of the tutorials, I believe that code reproducibility is a thing if you share your code with someone else. Other than that you will not have to bother about the np.random.seed(2) code snippet ever.

in this case even though we provide seed, each time we call data.show_batch, we get different images?

But to answer your question, no not necessarily. np.random.seed(2) is a function to maintain code reproducibility not to show the same set of images every time you call data.show_batch.

Pseudo-RNGs are a rabbit hole and if you go down it, you will only keep going in and in. So it’s best if you let it that way & keep in mind you need to use an arbitrary number across all your scripts for np.random.seed, if you care to share it with someone.

racket99, I got the same confusion at the beginning. But they are sepparate functionalities. In fact, seeds fixed the division of your data set, and you training and validation will be always the same. However, data.show_batch is just a visualization tool which shows you random sample from your trainning set. i.e. if your trainning set has 1000 samples, show_batch will show you 12 - 9 random sample from those 1000, but those 1000 in your trainning set are always the same. If you want to change your 1000 images, you should change the number inside the seed.

Hope this was clear