Training language model with nn.DataParallel has unbalanced GPU memory usage

I’m using learn.model=nn.DataParallel(learn.model) as I’ve seen in the forums to try to scale up a model to train on multiple GPUs.


It seems to be working with multiple GPUs but training 1 epoch on 8x 2080ti is actually looking to be much slower than on 1x 2080ti.

I think this is because I haven’t been able to increase my batch size. If I try to increase the batch size I get a CUDA out of memory error because GPU 0 is disproportionately using more memory than the others. nvidia-smi output looks like this:

I’ve been doing some research and it looks like this is because nn.DataParallel accumulates the gradients onto a single GPU. There is some code in the last post there that purports to spread this out across all of the GPUs but I haven’t been able to get good results.

If I do learn.loss_func=CriterionParallel(learn.loss_func) as that post suggests (where CriterionParallel is lifted from the forum post) it does balance out the memory usage slightly but not much (and the estimated time for 1 epoch nearly doubles compared to not using it):

I also found a link to this project which also tries to solve the problem. But when I try to use encoding.parallel.DataParallelModel and encoding.parallel.DataParallelCriterion like this:

import encoding
learn.model = encoding.parallel.DataParallelModel(learn.model)
learn.loss_func = encoding.parallel.DataParallelCriterion(learn.loss_func)

I get an error: AttributeError: 'FlattenedLoss' object has no attribute 'parameters'. Pretty sure this is because normal Pytorch loss functions are subclasses of nn.Module whereas fastai’s FlattenedLoss doesn’t inherit from anything.


To get around this, I tried to make FlattenedLoss subclass nn.Module… but it was a dead end for me. It needed me to call super().__init__() before setting self.func but complained when I did that as well.

Anyone have a solution for balancing out memory usage so I can increase batch size to take advantage of multiple GPUs while training a language model?

Edit: found this thread that summarizes the problem and at the end someone suggests using Distributed instead of Parallel which I may try later.

1 Like

Using distributed looks like it works! I have even usage between GPUs and time per epoch looks ~8x faster than with a single GPU. The only downside is not being able to run it in a notebook but that’s a small price to pay!

I may have spoken too soon. At the end of the first epoch it spat this out 8x:

root@C.288618:/workspace$ python -m torch.distributed.launch --nproc_per_node=8
epoch     train_loss  valid_loss  accuracy  time
Traceback (most recent call last):
  File "", line 332, in <module>    learn.fit_one_cycle(1, 1e-2)
  File "/opt/conda/lib/python3.7/site-packages/fastai/", line 22, in fit_one_cycle, max_lr, wd=wd, callbacks=callbacks)  File "/opt/conda/lib/python3.7/site-packages/fastai/", line 196, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/opt/conda/lib/python3.7/site-packages/fastai/", line 105, in fit
    cb_handler=cb_handler, pbar=pbar)
  File "/opt/conda/lib/python3.7/site-packages/fastai/", line 56, in validate
    for xb,yb in progress_bar(dl, parent=pbar, leave=(pbar is not None)):
  File "/opt/conda/lib/python3.7/site-packages/fastprogress/", line 66, in __iter__
    for i,o in enumerate(self._gen):
  File "/opt/conda/lib/python3.7/site-packages/fastai/", line 75, in__iter__
    for b in self.dl: yield self.proc_batch(b)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/", line 615, in __next__    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/", line 615, in <listcomp>
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/opt/conda/lib/python3.7/site-packages/fastai/text/", line 77, in __getitem__[j],self.ri[j] = self.fill_row(not self.backwards, self.dataset.x.items, self.idx, self.batch[j],
  File "/opt/conda/lib/python3.7/site-packages/fastai/text/", line 40, in __getattr__
    def __getattr__(self,k:str)->Any: return getattr(self.dataset, k)  File "/opt/conda/lib/python3.7/site-packages/fastai/", line 626, in __getattr__
    raise AttributeError(k)
AttributeError: batch

Going to dig in and see what I can figure out.

Edit: well that was easy, fixed on master already!

And in v1.0.51 now :wink: Yes DataParallel isn’t fully supported for everything (Unets won’t work for instance). distributed should be more reliable.

1 Like

Is there an easy way to make it work for Unet?

No, it can’t work since PyTorch’s Hook don’t seem to behave nicely in DataParallel. Asked the question on clues to fix it on the PyTorch’s slack but never got any answer.

Hi Brad, I’m seeing the same issue with unbalanced GPU utilization using DataParallel. Just to clarify, was your solution to use pytorch’s DistributedDataParallel? Or something else? If the the former can you walk me through the steps you used to refactor your code from DataParallel to DistributedDataParallel?


This is what I ended up using:


I’m trying to learn my language model with fit_one_cycle. When I’m doing distributed learning with this method explained in the link which @yeldarb gave, after starting my script I’ve got something like this:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

and after some frames

frame #42: __libc_start_main + 0xf0 (0x7f5f49d5d830 in /lib/x86_64-linux-gnu/
frame #43: _start + 0x29 (0x591eb9 in /data/chm/seg_mod/bin/python3)

I’m having an error like this:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argumentfind_unused_parameters=Truetotorch.nn.parallel.DistributedDataParallel; (2) making sure allforwardfunction outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforwardfunction. Please include the loss function and the structure of the return value offorwardof your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:518)

What am I doing wrong? I have got 2 x 1080 ti GPUs

I am currently trying to solve the same issue - did you find a solution?

I found this explanation on StackOverflow but so far I could not derive a solution from it:

The solution provided here seems to work.
Add find_unused_parameters=True to line line 32 in

self.learn.model = DistributedDataParallel(self.model, device_ids=[self.cuda_id], output_device=self.cuda_id, find_unused_parameters=True)

1 Like