RuntimeError: Only CUDA dense tensor is supported for NCCL collective operations

Hi I am new to fastai. I am trying to run UNet as described in lesson 3 part 1 for my custom dataset. I am successfully able to run and get the result on single GPU. However now I am trying to use distributed training approach for the same and I get following error.

> File "UNet_benchmark.py", line 98, in <module>
learn.fit_one_cycle(2, slice(lr), pct_start=0.9)

File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/train.py”, line 23, in fit_one_cycle
learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/basic_train.py”, line 200, in fit
fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/basic_train.py”, line 106, in fit
cb_handler=cb_handler, pbar=pbar)
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/basic_train.py”, line 63, in validate
if cb_handler and cb_handler.on_batch_end(val_losses[-1]): break
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/callback.py”, line 308, in on_batch_end
self(‘batch_end’, call_mets = not self.state_dict[‘train’])
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/callback.py”, line 250, in call
for met in self.metrics: self._call_and_update(met, cb_name, **kwargs)
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/callback.py”, line 241, in call_and_update
new = ifnone(getattr(cb, f’on
{cb_name}’)(**self.state_dict, **kwargs), dict())
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/callback.py”, line 347, in on_batch_end
dist.all_reduce(val, op=dist.ReduceOp.SUM)
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 831, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: Only CUDA dense tensor is supported for NCCL collective operations

Following is the accuracy function I am using

    def iou(y_pred, y_true):
        code = name2id['sidewalk'] + 1
        y_true = y_true.squeeze(1)
        y_true = y_true == code
        y_pred = y_pred.argmax(dim=1)
        y_pred = y_pred == code
        intersection = y_true & y_pred
        union = y_true | y_pred
        iou_score = (torch.sum(intersection).item() + smooth) / (torch.sum(union).item() + smooth)
        return tensor(iou_score)

I am using this reference

        learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd)
        learn.callback_fns.append(partial(LearnerTensorboardWriter,
             base_dir=tboard_path,
             name='exp frozen encoder'))
         learn = learn.to_distributed(args.local_rank)