Distributed training fails

j.laute · February 9, 2020, 9:08pm

I’m having issues with distributed in fastai2.
I have the following code at the top of my training script:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--gpu", type=int)
args = parser.parse_args()
torch.cuda.set_device(args.gpu)
torch.distributed.init_process_group(backend='nccl', init_method='env://')

Then I call to_distributed on the learner just befor fitting:

learn_gen = learn_gen.to_distributed(args.gpu)

First I get this warning:

callback. Use `self.learn.dl` if you would like to change it in the learner.

and then this error (one per gpu):

Traceback (most recent call last):
  File "train.py", line 363, in <module>
    learn_gen.fit_one_cycle(1, lr_max=1e-3, pct_start=0.4, wd=1e-3)
  File "/root/fastai2/fastai2/callback/schedule.py", line 90, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/root/fastai2/fastai2/learner.py", line 294, in fit
    self._do_epoch_train()
  File "/root/fastai2/fastai2/learner.py", line 269, in _do_epoch_train
    self.all_batches()
  File "/root/fastai2/fastai2/learner.py", line 247, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/root/fastai2/fastai2/data/load.py", line 99, in __iter__
    yield self.after_batch(b)
  File "/root/fastcore/fastcore/transform.py", line 188, in __call__
    def __call__(self, o): return compose_tfms(o, tfms=self.fs, split_idx=self.split_idx)
  File "/root/fastcore/fastcore/transform.py", line 136, in compose_tfms
    x = f(x, **kwargs)
  File "/root/fastcore/fastcore/transform.py", line 71, in __call__
    def __call__(self, x, **kwargs): return self._call('encodes', x, **kwargs)
  File "/root/fastcore/fastcore/transform.py", line 83, in _call
    res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
  File "/root/fastcore/fastcore/transform.py", line 83, in <genexpr>
    res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
  File "/root/fastcore/fastcore/transform.py", line 87, in _do_call
    return x if f is None else retain_type(f(x, **kwargs), x, f.returns_none(x))
  File "/root/fastcore/fastcore/dispatch.py", line 98, in __call__
    return f(*args, **kwargs)
  File "/root/fastai2/fastai2/data/transforms.py", line 293, in encodes
    def encodes(self, x:TensorImage): return (x-self.mean) / self.std
  File "/root/fastai2/fastai2/torch_core.py", line 272, in _f
    res = getattr(super(TensorBase, self), fn)(*args, **kwargs)
RuntimeError: expected device cpu but got device cuda:1

I think the warning might be a bug in the DistributedTrainer but I don’t know the code well enough to know.
After fixing the warning like this:

@patch
def begin_train(self:DistributedTrainer):    self.learn.dl = self._wrap_dl(self.learn.dl)
@patch
def begin_validate(self:DistributedTrainer): self.learn.dl = self._wrap_dl(self.learn.dl)

the warning goes away but the error persists.

Adding some print statements about the device in the encodes of normalize revealed that initially the input to normalize is on the correct device but then it switches and receives inputs on “cpu”:

Could not do one pass in your dataloader, there is something wrong in it
x.device: cuda:1, mean.device: cuda:1, std.device:cuda:1
Could not do one pass in your dataloader, there is something wrong in it
x.device: cuda:3, mean.device: cuda:3, std.device:cuda:3
Could not do one pass in your dataloader, there is something wrong in it
x.device: cuda:0, mean.device: cuda:0, std.device:cuda:0
Could not do one pass in your dataloader, there is something wrong in it
epoch     train_loss  valid_loss  pixel     feat_0    feat_1    feat_2    time    
x.device: cpu, mean.device: cuda:1, std.device:cuda:1------------------------------------------------------------| 0.00% [0/4799 00:00<00:00]
x.device: cpu, mean.device: cuda:3, std.device:cuda:3
x.device: cpu, mean.device: cuda:0, std.device:cuda:0
x.device: cpu, mean.device: cuda:2, std.device:cuda:2

Any ideas what going on here?

sgugger · February 9, 2020, 11:13pm

A bug was mentioned in the issues, but we are on a tight deadline for the book, so won’t have time to investigate until it has passed.

j.laute · February 10, 2020, 2:44pm

I figured out the issue, PR is opened here

After the Cuda transform was removed in favor of a device parameter to the DataLoader, the DistributedDL.from_dl method was not updated to pass the device from the dl to the distributed dataloader.

I also fixed this warning by referencing self.learn.dl as suggested:

/home/jaidmin/Software/Devel/fastai2/fastai2/learner.py:30: UserWarning: You are setting an attribute (dl) that also exists in the learner. Please be advised that you're not setting it in the learner but in the callback. Use `self.learn.dl` if you would like to change it in the learner.

sgugger · February 10, 2020, 5:59pm

Thanks for the fixes!

gokkulnath · March 16, 2020, 4:27pm

Hi @j.laute

I am currenlty trying to use Distrubuted trainning with fastai2.
Can you share your minimal working code ? I am not able to get it working , finding it a bit confusing due to lack of documentation? P.S : I am newbie to distributed training so it would be great if you can explain a bit about the parameters rank and worldsize?
Currently my pipeline looks like

Use Datablock API
Create DLs from the Datablock instance
Create a Learner using the dls
DistributedDL.from_dl(dls, rank=1, world_size=8) --> For a 8GPU aws p2.8xlarge instance
Create a Distribute learner using learn.to_distributed(cuda_id)

Thanks in advance!

gokkulnath · March 17, 2020, 10:42pm

@sgugger Can you please help me with a minimal example code? I am not able to make it work with the existing documentation. P.S: Sorry for the trouble. First time training a model in a distributed fashion.
Thanks a lot.

sgugger · March 17, 2020, 10:49pm

Look at the example in the train_imagenette example.