To_distributed on language_model_learner() crashes on training

danaludwig · April 30, 2019, 5:19am

After following tutorial “How to launch a distributed training”, I applied it to language_model_learner() and it failed. I think I have apex installed by not pytorch_nightly. Here is the error (even with one GPU:)

Traceback (most recent call last):                                                                           
  File "distrib_lm_2019_04_29.py", line 149, in <module>
    learn.fit_one_cycle(10, slice(1e-2), moms=moms)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/train.py", line 22, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/basic_train.py", line 199, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/basic_train.py", line 101, in fit
    loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/basic_train.py", line 34, in loss_batch
    if not skip_bwd:                     loss.backward()
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/home/dludwig1/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f87991040a0>, [[tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0', dtype=torch.float16), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0', dtype=torch.float16), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16), None, tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0', dtype=torch.float16), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0', dtype=torch.float16), tensor([-2232.0000,    13.0703,   -33.2812,  ...,    14.1641,    13.2656,
           13.6797], device='cuda:0', dtype=torch.float16)]], [0]

I may need to update versions of fastai and/or pytorch, but so far nothing on the install page has worked.

yeldarb · May 1, 2019, 12:31am

I’ve been doing distributed and it’s worked Ok so far. Have you tested to see if it might be something in your fp16 changes?

Edit: why did you use apex rather than Learner.to_fp16?

sgugger · May 1, 2019, 12:53am

Using Apex would make the training loop crash, as it changes the optimizer (unless you went out of your way to make it compatible with a Callback).

danaludwig · May 1, 2019, 5:55am

Hi Brad and Sylvain, I just commented out the line referring to apex:

# import apex.fp16_utils as fp16

It still crashes with the same message. Brad, I am using to_fp16:

learn = language_model_learner(data_lm, AWD_LSTM, pretrained=False).to_fp16().to_distributed(args.local_rank)

I read from Syvain’s earlier post that I need fastai v1.0.51. I installed that with:

pip install git+https://github.com/fastai/fastai.git

But when I do “conda list fastai” it shows “1.0.50.post1”. How do I know what version my script is using?

yeldarb · May 1, 2019, 12:38pm

See if it still happens if you remove your fp16 code. And try a version without your model tweaks. Also try with the vision example from the docs.

If one of those works it’ll at least narrow down your issue. If not; then messing with fastai/pytorch versions may be the next step.

I think I had things “working” on 1.0.50post1 (just with not great accuracy due to DistributedSampler shuffling when it shouldn’t).

danaludwig · May 3, 2019, 5:58pm

Thank you, Brad! Good strategy. I will start trying these things after my current language model training finishes tomorrow. If distributed looks good, I can replace my other three 1080’s with RTX 2080TI’s. It’s kind of pricey at $1400 each, but still cheaper than buying a car. Congrats for getting a box with 8x RTX 2080TI’s. With fastai, you should be able to match anything that Google can come up with

renard · May 10, 2019, 8:37pm

I’ve had the same error message. I used launch.py script from https://github.com/fastai/fastai/blob/master/fastai/launch.py and for training model I modified the script from here https://github.com/fastai/fastai/blob/e6409953196cfa9ab8297c2603612bdbbc18f565/examples/train_cifar.py
like this

from fastai.script import *
from fastai.text import *
from fastai.distributed import *
torch.backends.cudnn.benchmark = True

common_path = '/home/renard/'
lm_file_name = 'language_model_data.csv'
short_lm_file_name = 'short_language_model_data.csv'
binary_data_file_name = '02042019_binary_cleaned_bullying.csv'

batch_size = 210

path='/home/renard'

@call_parse
def main( gpu:Param("GPU to run on", str)=None ):
    """Distrubuted training of CIFAR-10.
    Fastest speed is if you run as follows:
        python -m fastai.launch train_cifar.py"""

    data_lm = load_data('/home/renard', 'data_lm.pkl', bs=batch_size)

    gpu = setup_distrib(gpu)
    n_gpus = num_distrib()
    workers = min(16, num_cpus()//n_gpus)
    data_lm = load_data('/home/renard', 'data_lm.pkl', bs=batch_size//n_gpus)

    learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3, pretrained=False)
    if gpu is None: learn.model = nn.DataParallel(learn.model)
    else: learn.to_distributed(gpu)
    learn.fit_one_cycle(1, 4e-2, moms=(0.8,0.7))
    learn.unfreeze()
    learn.fit_one_cycle(6, 4e-4, moms=(0.8,0.7))
    learn.save('dist_lm_{}'.foramt(gpu))

I changed architecture to Transformer because I found a comment about RNN models not supporting Parallel/Distributed computing but I’ve got the same error message.

Earlier I used DataParallel on AWD_LSTM like this

learner.model = nn.DataParallel(learner.model)

model loaded on all GPUs but only one of them was processing samples (memory use increased from 2k to 15k). Other GPUs had non zero GPU-Util parameter but estimated training time was longer than with single GPU (Single GPU 6h per epoch, 4 GPUs 8h). First I thought batch size is too small but when I increased it I got GPU OOM error. I’m adding a picture of the console that I took during the training of a DataParallel model (in the case of distributed I was never able to start the training)

After unsuccessfull atempts to parallelize language model I wanted to make sure I’m not making some kind of stupid mistake so I checked if parallelization works for me on vision CNNs. It does.

kcturgutlu · June 14, 2019, 4:52am

Any updates on this thread? I am also facing the same issue with language_model_learner:

       [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:1', dtype=torch.float16), tensor([[-0.0034,  0.0070, -0.0011,  ...,  0.0082, -0.0056, -0.0018],
        [ 0.0298, -0.0601,  0.0119,  ..., -0.0657,  0.0471,  0.0157],
        [ 0.0252, -0.0493,  0.0099,  ..., -0.0545,  0.0420,  0.0162],
        ...,
        [ 0.0160, -0.0316,  0.0052,  ..., -0.0331,  0.0238,  0.0080],
        [-0.0216,  0.0455, -0.0077,  ...,  0.0490, -0.0348, -0.0115],
        [ 0.0177, -0.0342,  0.0061,  ..., -0.0376,  0.0265,  0.0085]],
       device='cuda:1', dtype=torch.float16), None, tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:1', dtype=torch.float16), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:1', dtype=torch.float16), tensor([  2.2324,   2.1836, -74.5625,  ...,   2.1406,   2.1582,   2.1660],
       device='cuda:1', dtype=torch.float16)]], [1]
Traceback (most recent call last):
  File "lm_lang_train_script.py", line 17, in <module>
    def main(gpu:Param("GPU to run on", str)=None):
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/fastai/script.py", line 40, in call_parse
    func(**args.__dict__)
  File "lm_lang_train_script.py", line 98, in main
    learn_lang_lm.fit_one_cycle(10, lr, moms=(0.8,0.7))
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/fastai/train.py", line 22, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py", line 101, in fit
    loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py", line 34, in loss_batch
    if not skip_bwd:                     loss.backward()
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/usr/local/share/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f0cf4d49030>, [[None, tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:2', dtype=torch.float16), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:2', dtype=torch.float16), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:2', dtype=torch.float16), tensor([[-0.0047,  0.0104, -0.0009,  ...,  0.0120, -0.0073, -0.0027],
        [ 0.0257, -0.0571,  0.0064,  ..., -0.0629,  0.0356,  0.0140],
        [ 0.0447, -0.0952,  0.0126,  ..., -0.1051,  0.0648,  0.0222],
        ...,
        [ 0.0157, -0.0322,  0.0034,  ..., -0.0357,  0.0211,  0.0078],
        [-0.0143,  0.0321, -0.0040,  ...,  0.0361, -0.0212, -0.0075],
        [ 0.0212, -0.0449,  0.0055,  ..., -0.0494,  0.0303,  0.0109]],
       device='cuda:2', dtype=torch.float16), None, tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:2', dtype=torch.float16), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:2', dtype=torch.float16), tensor([  2.2305,   2.1797, -48.9688,  ...,   2.1426,   2.1445,   2.1582],
       device='cuda:2', dtype=torch.float16)]], [2]

danaludwig · June 24, 2019, 12:58am

As I recall, I had a breakthrough when @stas gave me the command line to install the “nightly” version of PyTorch.

conda install -c pytorch pytorch-nightly

He may have done some other things too, but he got a LM to train with 2 GPU’s using to_distributed. I remember it went twice as fast as 1 GPU, but the improvement in validation loss wasn’t as great with each epoch as it was with one GPU. Still it was a big win!