[Solved] Reproducibility: Where is the randomness coming in?

pinned fastcore to 0.1.12 (0.1.13 was complainging about as_item missing).
not able to get rid of randomness.

i grabbed your random function and tried it with the mnist example from the walkthrough, still random :frowning:

using only pytorch (not fastai in this case, but no less amazing https://github.com/qubvel/segmentation_models.pytorch ),

was having same problem on jupyter, getting reproducible on run all cells (without restart ) using all the seed/force deterministic operations listed above,

but between kernel restarts results were always different :frowning:

note here: all results (splits, augmentation, pre train val epoch) were equal until torch training starts. during training something is being affected that I could only solve by setting mentioned PYTHONHASHSEED env prior to starting jupyter.

So far after doing this, can fully reproduce results between restarts. finally!
really tricky issue and hard to detect. prob a lot of people think they have reproducible results when they havent?.. (mostly like valid backups :slight_smile: )

next step: check container restarts :), host restarts, and different vms, and cloud providers… who knows…? :slight_smile:

(note: as mentioned by @esingildinov PYTHONHASHSEED has to be set prior to jupyter/kernel start, setting env var in notebook doesnt work, same thing noted here:



)

2 Likes

Any tips on how to this with collab ?

Here is an example of reproducibility in fastai2:

from fastai.vision.all import *
def is_cat(x): return x[0].isupper()
path = untar_data(URLs.PETS)/'images'
set_seed(42,True)
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(224))
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fit(1)
set_seed(42,True)
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(224))
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fit(1)

please notice that you much set the seed before the dataloader is created, and recreate the dataloader when setting a new seed.

Dataloader keeps an internal Random Number Generator that is seeded with a random number. The seed is not updated with set_seed which is why you have to recreate it.

5 Likes

Just wanted to say that this is one of the most important threads on the forum … and hopefully will find its way into the library and also a dedicated place in the docs w/r/t to when the seed needs to be set.

4 Likes

So what is the definition here of “reproducible”?

I’ve followed the above steps, running my “set_seed” function before creating my dataloaders via dblock.dataloaders(df, num_workers=0), setting the seed in my RandomSplitter, and running that “set_seed” function before creating my Learner and before every call to fit_one_cyclebut the results are never identical.

And that is what I actually expect when training a NN.

So am I wrong? Are folks expecting and getting the exact same validation loss and metrics after each epoch every time they re-run their data loading and training loop code?

Run #1

Run #2

No kernel restart … no fit_cbs … just rebuilding the Learner and calling fit_one_cycle.

2 Likes

running a tabular model , was able to reproduce exact same validation loss and metrics using the random_seed() and placing random_seed(0,use_cuda=False ) before tabular_learner and also fit_one_cycle

many thanks to @Pomo , @rpcoelho and others

1 Like

Simple and clear, thanks!!!

1 Like