I tried to run fastai library on JarvisLabs with 2x GPUs using its distributed functionality as follows.
# train.py
... # some setup code
learn = get_learner(get_dls(10))
learn.unfreeze()
with learn.distrib_ctx():
learn.fit_one_cycle(1, slice(0.001, 0.01))
I launched it as:
accelerate launch train.py
However, the script failed with an error. I don’t have a stack trace. (I’ll try to publish it a bit later, as I disabled the system and cannot rerun it right now.) But it was something about socket connection issues.
I think that the problem could be unrelated to fastai library itself. Probably something with instance configuration. However, I decided to ask this question on the forum in case if someone encountered any accelerate/fastai related issues on that platform.
Did anyone try to run distributed context there? Or maybe some other cloud provider? And in general, from your point of view, what would be the best platform to run training on x2-x4 GPUs?
Actually, I tried to re-run my script again, and this time I got a bit different error:
RuntimeError: Expected to have finished reduction in the prior iteration before
starting a new one. This error indicates that your module has parameters that were
not used in producing loss. You can enable unused parameter detection by passing
the keyword argument `find_unused_parameters=True` to
`torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward`
function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't
able to locate the output tensors in the return value of your module's `forward` function.
Please include the loss function and the structure of the return value of `forward` of your
module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 501 502 503. In addition,
you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either
INFO or DETAIL to print out information about which particular parameters did not
receive gradient on this rank as part of this error.
The snippet I use works on a single GPU, but doesn’t work when I try to launch it via accelerate. It looks as follows (with some parts omitted):
The same code works in Jupyter, but returns the error above when running on two GPUs. Also, I tried to follow the steps that @VishnuSubramanian recommended, and it works on the fastai’s example. However, my snippet fails. So, it seems like the problem is with my setup, and not with the accelerate or the cloud platform. I’ll try to debug it more carefully.
from accelerate.utils import DistributedDataParallelKwargs
kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
with learn.distrib_ctx(kwarg_handelers=[kwargs]):
....
Really appreciate you posting this.
Minor correction: It’s spelled kwargs_handlers.
from accelerate.utils import DistributedDataParallelKwargs
kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
with learn.distrib_ctx(kwargs_handlers=[kwargs]):
....