[Solved] Reproducibility: Where is the randomness coming in?

Pomo · December 1, 2018, 1:34am

Searching in these and in the PyTorch forums, it seem that many others have run into this issue of reproducibility. I gathered all their suggestions into the following code:

def random_seed(seed_value, use_cuda):
    np.random.seed(seed_value) # cpu vars
    torch.manual_seed(seed_value) # cpu  vars
    random.seed(seed_value) # Python
    if use_cuda: 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # gpu vars
        torch.backends.cudnn.deterministic = True  #needed
        torch.backends.cudnn.benchmark = False

The good news is that the training code above now gives repeatable results. I did not test to know precisely which initializations are critical, but do know that torch.backends.cudnn.deterministic = True is necessary, and the num_workers does not matter. The not so good news is this reproducibility does not survive a kernel restart.

The best news is that it also gives repeatable results across kernel restarts iff num_workers=0 is passed to the data loader. This has something to do with each worker getting initialized with its own random seeds. Someone more patient than I could devise a worker_init_fn that provides both kernel restart repeatability and different seeds for each worker. But for now I am content with using num_workers=0.

To sum up - to get reproducible measures across runs and kernel restarts, use the above random_seed function and pass num_workers=0 when generating the DataBunch. Non-repeatabilty was leaking in through CudaNN and the data loader workers.