Searching around the fastai code base, I wonder if the following two places may also be subjected to the potential race condition in DDP mode:
In RnnLearner.save_encoder(), the torch.save() is not guarded like Learner.save() is. The encoder output can be corrupted by multiple, simultaneous writes from multiple slave processes.
Any subsequent call to RnnLearner.load_encoder() may need synchronization as well, otherwise would run into the same problem in the above-mentioned discussion.
Interesting. First thing first, RnnLearner.save_encoder and DataBunch.save should indeed be guarded like Learner.save and Learner.export. I’ll do a sweep of all those save export functions and add it.
Then for the barrier thing, I think it should be done at the base level in the load functions (not n the callback like you suggest for instance) since it could impact anyone writing a distributed script.
@sgugger, greetings, and thanks for the quick response!
A sync’ing mechanism closer to the load() indeed sounds better, as long as it wouldn’t cause any deadlock. I can’t yet think of any scenario where it would.
A couple of implementation details around this barrier synchronization comes to mind and you may have thought of them already:
fastai.torch_core’s num_distrib() > 1 looks simpler and cleaner than the if torch.distributed.is_available() and torch.distributed.is_initialized(): I had in the gist.
Once the barrier is implemented inside load()., SaveModelCallback.on_train_end() won’t need the check of (self.learn.path/f'{self.learn.model_dir}/{self.name}.pth').is_file(). This check may not be correct anyway: consider if the master process is lagging behind severely (say by a few seconds), this condition will skip over the load() entirely.
Yes, the right check is num_distib() > 1. The same thing probably needs to be in RNNLearner.load_encoder by the way. Would you mind making a PR with this?
Looking at the fastai_dev/ branch, 13_learner.ipynb, currently the rank_distrib() check is in Learner.save() method, which calls save_model(), which then calls torch.save().
save_model() (andload_model()) are both exported as well.
I wonder if the rank_distrib() check should be moved to inside the exported function save_model(), to be closer to the ultimate torch.save() call? Any script that invoke save_model() directly will be covered as well.
*Similarly, the barrier synchronization I’m trying to port over to this branch should probably be inside load_model() just prior to the torch.load() , instead of inside Learner.load().)
I’m not too sure of the usefulness of save_model and load_model since they are only used in Learner.save/Learner.load. For now, I’d keep the checks in the Learner methods.
While creating the PR, I noticed a couple of workflow kinks related to script2notebook.py:
dev/local/launch.py will break the script, because it doesn’t have the “DO NOT EDIT! File to edit: dev/17_callback_tracker.ipynb (unless otherwise specified)” line.
Should launch.py be included into the list of notebook/export.py's _manual_mods ?
notebook/export.py cannot handle new #Cell in a python library. It expects the cell counts remain the same. I had to manually add a dummy cell into the existing ipynb, then regenerate it using script2notebook.py
notebook/export.py: script2notebook() convert the entire directory of python script to ipynb, regardless they were changed or not. I hacked up export.py and script2notebook.py in my local sandbox to allow specifying only one file.
The above hiccups in using script2notebook.py make me wonder — is this script still considered as the right way of updating the notebook documentation of new library functions? Or, should I simply edit the ipynb notebok directly?
Yes, since this file is a recent add, I forgot to add it in the manual modules.
As for the other points, the development of any new functionality is expected to be done in a notebook, since it’s easier to test there and one can also add the proper documentation directly. Using script2notebook is only for when you do a quick fix in a module, like a typo correction.