I’m having issues with distributed in fastai2.
I have the following code at the top of my training script:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--gpu", type=int)
args = parser.parse_args()
torch.cuda.set_device(args.gpu)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
Then I call to_distributed
on the learner just befor fitting:
learn_gen = learn_gen.to_distributed(args.gpu)
First I get this warning:
callback. Use `self.learn.dl` if you would like to change it in the learner.
and then this error (one per gpu):
Traceback (most recent call last):
File "train.py", line 363, in <module>
learn_gen.fit_one_cycle(1, lr_max=1e-3, pct_start=0.4, wd=1e-3)
File "/root/fastai2/fastai2/callback/schedule.py", line 90, in fit_one_cycle
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
File "/root/fastai2/fastai2/learner.py", line 294, in fit
self._do_epoch_train()
File "/root/fastai2/fastai2/learner.py", line 269, in _do_epoch_train
self.all_batches()
File "/root/fastai2/fastai2/learner.py", line 247, in all_batches
for o in enumerate(self.dl): self.one_batch(*o)
File "/root/fastai2/fastai2/data/load.py", line 99, in __iter__
yield self.after_batch(b)
File "/root/fastcore/fastcore/transform.py", line 188, in __call__
def __call__(self, o): return compose_tfms(o, tfms=self.fs, split_idx=self.split_idx)
File "/root/fastcore/fastcore/transform.py", line 136, in compose_tfms
x = f(x, **kwargs)
File "/root/fastcore/fastcore/transform.py", line 71, in __call__
def __call__(self, x, **kwargs): return self._call('encodes', x, **kwargs)
File "/root/fastcore/fastcore/transform.py", line 83, in _call
res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
File "/root/fastcore/fastcore/transform.py", line 83, in <genexpr>
res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
File "/root/fastcore/fastcore/transform.py", line 87, in _do_call
return x if f is None else retain_type(f(x, **kwargs), x, f.returns_none(x))
File "/root/fastcore/fastcore/dispatch.py", line 98, in __call__
return f(*args, **kwargs)
File "/root/fastai2/fastai2/data/transforms.py", line 293, in encodes
def encodes(self, x:TensorImage): return (x-self.mean) / self.std
File "/root/fastai2/fastai2/torch_core.py", line 272, in _f
res = getattr(super(TensorBase, self), fn)(*args, **kwargs)
RuntimeError: expected device cpu but got device cuda:1
I think the warning might be a bug in the DistributedTrainer
but I don’t know the code well enough to know.
After fixing the warning like this:
@patch
def begin_train(self:DistributedTrainer): self.learn.dl = self._wrap_dl(self.learn.dl)
@patch
def begin_validate(self:DistributedTrainer): self.learn.dl = self._wrap_dl(self.learn.dl)
the warning goes away but the error persists.
Adding some print statements about the device in the encodes of normalize revealed that initially the input to normalize is on the correct device but then it switches and receives inputs on “cpu”:
Could not do one pass in your dataloader, there is something wrong in it
x.device: cuda:1, mean.device: cuda:1, std.device:cuda:1
Could not do one pass in your dataloader, there is something wrong in it
x.device: cuda:3, mean.device: cuda:3, std.device:cuda:3
Could not do one pass in your dataloader, there is something wrong in it
x.device: cuda:0, mean.device: cuda:0, std.device:cuda:0
Could not do one pass in your dataloader, there is something wrong in it
epoch train_loss valid_loss pixel feat_0 feat_1 feat_2 time
x.device: cpu, mean.device: cuda:1, std.device:cuda:1------------------------------------------------------------| 0.00% [0/4799 00:00<00:00]
x.device: cpu, mean.device: cuda:3, std.device:cuda:3
x.device: cpu, mean.device: cuda:0, std.device:cuda:0
x.device: cpu, mean.device: cuda:2, std.device:cuda:2
Any ideas what going on here?