Anyone know how to troubleshoot RuntimeError
with multi-gpu training when passing weights
to initialize CrossEntropyLossFlat
or BCEWithLogitsLossFlat
?
Exception
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA_nll_loss_forward)
Code
from fastai.vision.all import *
from fastai.text.all import *
from fastai.tabular.all import *
from fastai.collab import *
from accelerate import notebook_launcher
from fastai.distributed import *
path = untar_data(URLs.PETS)/'images'
def train():
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2,
label_func=lambda x: x[0].isupper(), item_tfms=Resize(224))
wgt = tensor([1.,2.]) # <---- This breaks multi-gpu training
learn = vision_learner(dls, resnet34, metrics=error_rate, loss_func=CrossEntropyLossFlat(weight=wgt)).to_fp16()
with learn.distrib_ctx(in_notebook=True, sync_bn=False):
learn.fine_tune(1)
notebook_launcher(train, num_processes=4)
Observations:
- Setting
wgt = None
results in successful training. - Setting
wgt = tensor([1.,2.]).to(1)
results in same error, but also strangely complains about...found at least two devices, cuda:2 and cuda:0!
, which is not even thewgt
’s assigned device!