Consequences of Making A PyTorch Model Run In Parallel?

Hello,

I am curious if I set a model to run on multiple GPUs with pytorch, something like below, what could be repercussions of it. Does PyTorch nn.DataParallel model do all the magic behind the scenes for me even if the model was not necessarily set up to be trained in parallel. What types of drawbacks could this cause?

learn.model = nn.DataParallel(learn.model)

In my training, so far accuracy seems to be around the same as it was before when I was training on one GPU. I am curious, what are some basic concepts of how to modify a pytorch model to run on multiple GPUs. I read the docs (https://pytorch.org/docs/stable/nn.html#dataparallel-layers-multi-gpu-distributed) and I saw this note, " The parallelized module must have its parameters and buffers on device_ids[0] before running this DataParallel module.", But to be honest I really don’t know what that means. So was curious if anyone here could help explain a bit more the context I may be missing.

Also I am using v1 of fastai.text.learner.language_model_learner function to create the Language Model. So the model I am getting is coming from learn.model, returned from that function. I didn’t see anything specific to parallelism in the code for language_model_learner, so wasn’t sure if just setting Dataparallel on the model could break something.

Thanks
Josh Kurz