To_distributed on language_model_learner() crashes on training

After following tutorial “How to launch a distributed training”, I applied it to language_model_learner() and it failed. I think I have apex installed by not pytorch_nightly. Here is the error (even with one GPU:)

Traceback (most recent call last):                                                                           
  File "distrib_lm_2019_04_29.py", line 149, in <module>
    learn.fit_one_cycle(10, slice(1e-2), moms=moms)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/train.py", line 22, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/basic_train.py", line 199, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/basic_train.py", line 101, in fit
    loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/basic_train.py", line 34, in loss_batch
    if not skip_bwd:                     loss.backward()
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f87991040a0>, [[tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0', dtype=torch.float16), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0', dtype=torch.float16), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16), None, tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0', dtype=torch.float16), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0', dtype=torch.float16), tensor([-2232.0000,    13.0703,   -33.2812,  ...,    14.1641,    13.2656,
           13.6797], device='cuda:0', dtype=torch.float16)]], [0]

I may need to update versions of fastai and/or pytorch, but so far nothing on the install page has worked.

I’ve been doing distributed and it’s worked Ok so far. Have you tested to see if it might be something in your fp16 changes?

Edit: why did you use apex rather than Learner.to_fp16?

1 Like

Using Apex would make the training loop crash, as it changes the optimizer (unless you went out of your way to make it compatible with a Callback).

2 Likes

Hi Brad and Sylvain, I just commented out the line referring to apex:

# import apex.fp16_utils as fp16

It still crashes with the same message. Brad, I am using to_fp16:

learn = language_model_learner(data_lm, AWD_LSTM, pretrained=False).to_fp16().to_distributed(args.local_rank)

I read from Syvain’s earlier post that I need fastai v1.0.51. I installed that with:

pip install git+https://github.com/fastai/fastai.git

But when I do “conda list fastai” it shows “1.0.50.post1”. How do I know what version my script is using?

1 Like

See if it still happens if you remove your fp16 code. And try a version without your model tweaks. Also try with the vision example from the docs.

If one of those works it’ll at least narrow down your issue. If not; then messing with fastai/pytorch versions may be the next step.

I think I had things “working” on 1.0.50post1 (just with not great accuracy due to DistributedSampler shuffling when it shouldn’t).

2 Likes

Thank you, Brad! Good strategy. I will start trying these things after my current language model training finishes tomorrow. If distributed looks good, I can replace my other three 1080’s with RTX 2080TI’s. It’s kind of pricey at $1400 each, but still cheaper than buying a car. Congrats for getting a box with 8x RTX 2080TI’s. With fastai, you should be able to match anything that Google can come up with :slight_smile:

I’ve had the same error message. I used launch.py script from https://github.com/fastai/fastai/blob/master/fastai/launch.py and for training model I modified the script from here https://github.com/fastai/fastai/blob/e6409953196cfa9ab8297c2603612bdbbc18f565/examples/train_cifar.py
like this

from fastai.script import *
from fastai.text import *
from fastai.distributed import *
torch.backends.cudnn.benchmark = True

common_path = '/home/renard/'
lm_file_name = 'language_model_data.csv'
short_lm_file_name = 'short_language_model_data.csv'
binary_data_file_name = '02042019_binary_cleaned_bullying.csv'

batch_size = 210

path='/home/renard'

@call_parse
def main( gpu:Param("GPU to run on", str)=None ):
    """Distrubuted training of CIFAR-10.
    Fastest speed is if you run as follows:
        python -m fastai.launch train_cifar.py"""

    data_lm = load_data('/home/renard', 'data_lm.pkl', bs=batch_size)

    gpu = setup_distrib(gpu)
    n_gpus = num_distrib()
    workers = min(16, num_cpus()//n_gpus)
    data_lm = load_data('/home/renard', 'data_lm.pkl', bs=batch_size//n_gpus)

    learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3, pretrained=False)
    if gpu is None: learn.model = nn.DataParallel(learn.model)
    else: learn.to_distributed(gpu)
    learn.fit_one_cycle(1, 4e-2, moms=(0.8,0.7))
    learn.unfreeze()
    learn.fit_one_cycle(6, 4e-4, moms=(0.8,0.7))
    learn.save('dist_lm_{}'.foramt(gpu))

I changed architecture to Transformer because I found a comment about RNN models not supporting Parallel/Distributed computing but I’ve got the same error message.

Earlier I used DataParallel on AWD_LSTM like this

learner.model = nn.DataParallel(learner.model)

model loaded on all GPUs but only one of them was processing samples (memory use increased from 2k to 15k). Other GPUs had non zero GPU-Util parameter but estimated training time was longer than with single GPU (Single GPU 6h per epoch, 4 GPUs 8h). First I thought batch size is too small but when I increased it I got GPU OOM error. I’m adding a picture of the console that I took during the training of a DataParallel model (in the case of distributed I was never able to start the training)

After unsuccessfull atempts to parallelize language model I wanted to make sure I’m not making some kind of stupid mistake so I checked if parallelization works for me on vision CNNs. It does.

1 Like

Any updates on this thread? I am also facing the same issue with language_model_learner:

       [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:1', dtype=torch.float16), tensor([[-0.0034,  0.0070, -0.0011,  ...,  0.0082, -0.0056, -0.0018],
        [ 0.0298, -0.0601,  0.0119,  ..., -0.0657,  0.0471,  0.0157],
        [ 0.0252, -0.0493,  0.0099,  ..., -0.0545,  0.0420,  0.0162],
        ...,
        [ 0.0160, -0.0316,  0.0052,  ..., -0.0331,  0.0238,  0.0080],
        [-0.0216,  0.0455, -0.0077,  ...,  0.0490, -0.0348, -0.0115],
        [ 0.0177, -0.0342,  0.0061,  ..., -0.0376,  0.0265,  0.0085]],
       device='cuda:1', dtype=torch.float16), None, tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:1', dtype=torch.float16), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:1', dtype=torch.float16), tensor([  2.2324,   2.1836, -74.5625,  ...,   2.1406,   2.1582,   2.1660],
       device='cuda:1', dtype=torch.float16)]], [1]
Traceback (most recent call last):
  File "lm_lang_train_script.py", line 17, in <module>
    def main(gpu:Param("GPU to run on", str)=None):
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/fastai/script.py", line 40, in call_parse
    func(**args.__dict__)
  File "lm_lang_train_script.py", line 98, in main
    learn_lang_lm.fit_one_cycle(10, lr, moms=(0.8,0.7))
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/fastai/train.py", line 22, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py", line 101, in fit
    loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py", line 34, in loss_batch
    if not skip_bwd:                     loss.backward()
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f0cf4d49030>, [[None, tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:2', dtype=torch.float16), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:2', dtype=torch.float16), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:2', dtype=torch.float16), tensor([[-0.0047,  0.0104, -0.0009,  ...,  0.0120, -0.0073, -0.0027],
        [ 0.0257, -0.0571,  0.0064,  ..., -0.0629,  0.0356,  0.0140],
        [ 0.0447, -0.0952,  0.0126,  ..., -0.1051,  0.0648,  0.0222],
        ...,
        [ 0.0157, -0.0322,  0.0034,  ..., -0.0357,  0.0211,  0.0078],
        [-0.0143,  0.0321, -0.0040,  ...,  0.0361, -0.0212, -0.0075],
        [ 0.0212, -0.0449,  0.0055,  ..., -0.0494,  0.0303,  0.0109]],
       device='cuda:2', dtype=torch.float16), None, tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:2', dtype=torch.float16), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:2', dtype=torch.float16), tensor([  2.2305,   2.1797, -48.9688,  ...,   2.1426,   2.1445,   2.1582],
       device='cuda:2', dtype=torch.float16)]], [2]

As I recall, I had a breakthrough when @stas gave me the command line to install the “nightly” version of PyTorch.

conda install -c pytorch pytorch-nightly

He may have done some other things too, but he got a LM to train with 2 GPU’s using to_distributed. I remember it went twice as fast as 1 GPU, but the improvement in validation loss wasn’t as great with each epoch as it was with one GPU. Still it was a big win!

2 Likes