I came across this discussion on SaveModelCallback in single-host-multiple-GPU mode. I experimented a little bit with a gist here, to simulate the problem, and suggested a fix.
Searching around the fastai code base, I wonder if the following two places may also be subjected to the potential race condition in DDP mode:
In RnnLearner.save_encoder(), the
torch.save()is not guarded like
Learner.save()is. The encoder output can be corrupted by multiple, simultaneous writes from multiple slave processes.
Any subsequent call to
RnnLearner.load_encoder()may need synchronization as well, otherwise would run into the same problem in the above-mentioned discussion.
Similarly, DataBunch.save() calls fastai.torch_core.try_save(), which calls
torch.save()unguarded. Extra care is needed, if a script that saves the DataBunch is launched for distributed training as described here, especially the processes share the same file system for saving/loading data.
PyTorch’s “Getting Started on Distributed Data Parallel” explains the care needed in the “Save and Load Checkpoints” section.
torch.distributed.barrier() can help synchronize properly read-after-write and write-after-read among processes, be it model or data.
Should the two writers in #1 and #2 above be guarded as well?
Should the fastai guide to distributed training be updated to illustrate the issue and care needed ?
I’ll be happy to open issues and draft PRs for these.