Training a model with fastaiv2 with torch.cuda.set_device(1) obligates me to predict in cuda:1?

jlopez · June 18, 2021, 10:27am

Hi,
I have trained a simple model in a multiGPU server and for management resources I used torch.cuda.set_device(1) for the training.
Then I exported the model and loaded in another server (inference server) and I realised that I couldn’t load the model in cuda:0, so what I did was to create my own load_learner using map_location argument:

def my_load_learner(fname, cpu=True, pickle_module=pickle, map_location='cuda:0'):
"Load a `Learner` object in `fname`, optionally putting it on the `cpu`"
distrib_barrier()
    res = torch.load(fname, map_location='cpu' if cpu else map_location, pickle_module=pickle_module)
    if hasattr(res, 'to_fp32'): res = res.to_fp32()
    if cpu: res.dls.cpu()
    return res

This isn’t working when I try to predict with the loaded model. I have an error:

[…] File “/opt/conda/lib/python3.8/site-packages/torch/tensor.py”, line 995, in torch_function
ret = func(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Why I can’t do a traning in cuda:1 and load the model in cuda:0 for prediction?, there is something I’m missing?.

Thanks!

muellerzr · June 18, 2021, 10:29am

You need to also set the device in the DataLoaders. So res.dls.to(“cuda:1”)

jlopez · June 18, 2021, 10:40am

Thanks @muellerzr that did the trick, now that I know and checking the error it was there, in the error trace.