Searching in these and in the PyTorch forums, it seem that many others have run into this issue of reproducibility. I gathered all their suggestions into the following code:
def random_seed(seed_value, use_cuda):
np.random.seed(seed_value) # cpu vars
torch.manual_seed(seed_value) # cpu vars
random.seed(seed_value) # Python
if use_cuda:
torch.cuda.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value) # gpu vars
torch.backends.cudnn.deterministic = True #needed
torch.backends.cudnn.benchmark = False
The good news is that the training code above now gives repeatable results. I did not test to know precisely which initializations are critical, but do know that torch.backends.cudnn.deterministic = True is necessary, and the num_workers does not matter. The not so good news is this reproducibility does not survive a kernel restart.
The best news is that it also gives repeatable results across kernel restarts iff num_workers=0 is passed to the data loader. This has something to do with each worker getting initialized with its own random seeds. Someone more patient than I could devise a worker_init_fn that provides both kernel restart repeatability and different seeds for each worker. But for now I am content with using num_workers=0.
To sum up - to get reproducible measures across runs and kernel restarts, use the above random_seed function and pass num_workers=0 when generating the DataBunch. Non-repeatabilty was leaking in through CudaNN and the data loader workers.