Race conditions in SaveModelCallback when in distributed training mode

I came across this discussion on SaveModelCallback in single-host-multiple-GPU mode. I experimented a little bit with a gist here, to simulate the problem, and suggested a fix.

Searching around the fastai code base, I wonder if the following two places may also be subjected to the potential race condition in DDP mode:

  1. In RnnLearner.save_encoder(), the torch.save() is not guarded like Learner.save() is. The encoder output can be corrupted by multiple, simultaneous writes from multiple slave processes.
    Any subsequent call to RnnLearner.load_encoder() may need synchronization as well, otherwise would run into the same problem in the above-mentioned discussion.

  2. Similarly, DataBunch.save() calls fastai.torch_core.try_save(), which calls torch.save() unguarded. Extra care is needed, if a script that saves the DataBunch is launched for distributed training as described here, especially the processes share the same file system for saving/loading data.

PyTorch’s “Getting Started on Distributed Data Parallel” explains the care needed in the “Save and Load Checkpoints” section. torch.distributed.barrier() can help synchronize properly read-after-write and write-after-read among processes, be it model or data.

Should the two writers in #1 and #2 above be guarded as well?

Should the fastai guide to distributed training be updated to illustrate the issue and care needed ?

Any comments?

I’ll be happy to open issues and draft PRs for these.

Interesting. First thing first, RnnLearner.save_encoder and DataBunch.save should indeed be guarded like Learner.save and Learner.export. I’ll do a sweep of all those save export functions and add it.

Then for the barrier thing, I think it should be done at the base level in the load functions (not n the callback like you suggest for instance) since it could impact anyone writing a distributed script.

1 Like

@sgugger, greetings, and thanks for the quick response!

A sync’ing mechanism closer to the load() indeed sounds better, as long as it wouldn’t cause any deadlock. I can’t yet think of any scenario where it would.

A couple of implementation details around this barrier synchronization comes to mind and you may have thought of them already:

  1. fastai.torch_core’s num_distrib() > 1 looks simpler and cleaner than the if torch.distributed.is_available() and torch.distributed.is_initialized(): I had in the gist. :sweat_smile:

  2. Once the barrier is implemented inside load()., SaveModelCallback.on_train_end() won’t need the check of (self.learn.path/f'{self.learn.model_dir}/{self.name}.pth').is_file(). This check may not be correct anyway: consider if the master process is lagging behind severely (say by a few seconds), this condition will skip over the load() entirely.

Cheers.

Yes, the right check is num_distib() > 1. The same thing probably needs to be in RNNLearner.load_encoder by the way. Would you mind making a PR with this?

:ok_hand: sure I’ll draft a PR and go from there. Thanks for the feedback!

Hello,

PR#2393 submitted. Thanks!

Would it be useful to port this PR to fastai_dev as well? :thinking:

Yeah, that would be good!

Hello Sylvian,

Looking at the fastai_dev/ branch, 13_learner.ipynb, currently the rank_distrib() check is in Learner.save() method, which calls save_model(), which then calls torch.save().

save_model() (andload_model()) are both exported as well.

I wonder if the rank_distrib() check should be moved to inside the exported function save_model(), to be closer to the ultimate torch.save() call? Any script that invoke save_model() directly will be covered as well.

*Similarly, the barrier synchronization I’m trying to port over to this branch should probably be inside load_model() just prior to the torch.load() , instead of inside Learner.load().)

What do you think? Thanks.

Phil

I’m not too sure of the usefulness of save_model and load_model since they are only used in Learner.save/Learner.load. For now, I’d keep the checks in the Learner methods.

Hello Sylvain,

I created PR291 to port this over to fastai_dev.

While creating the PR, I noticed a couple of workflow kinks related to script2notebook.py:

  1. dev/local/launch.py will break the script, because it doesn’t have the “DO NOT EDIT! File to edit: dev/17_callback_tracker.ipynb (unless otherwise specified)” line.

    Should launch.py be included into the list of notebook/export.py's _manual_mods ?

  2. notebook/export.py cannot handle new #Cell in a python library. It expects the cell counts remain the same. I had to manually add a dummy cell into the existing ipynb, then regenerate it using script2notebook.py

  3. notebook/export.py: script2notebook() convert the entire directory of python script to ipynb, regardless they were changed or not. I hacked up export.py and script2notebook.py in my local sandbox to allow specifying only one file.

  4. The above hiccups in using script2notebook.py make me wonder — is this script still considered as the right way of updating the notebook documentation of new library functions? Or, should I simply edit the ipynb notebok directly?

Thanks.

Yes, since this file is a recent add, I forgot to add it in the manual modules.

As for the other points, the development of any new functionality is expected to be done in a notebook, since it’s easier to test there and one can also add the proper documentation directly. Using script2notebook is only for when you do a quick fix in a module, like a typo correction.

1 Like