Distributed and parallel training... explained

pierreguillou · July 1, 2020, 9:57pm

Thanks for your message @wgpubs but the problem is independent of the Transformers version. In fact, it does not come from Transformers v3: but the warning, it was the same problem with 2.11.0 (I updated today from 2.11.0 to 3.0.0).

And the Transformers tutorial of Sylvain works perfectly well with Transformers v3 on one GPU (at least on my server).

I think the problem is mainly here :

File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [4382] at entry 0 and [4065] at entry 1

My understanding is that the training and validation datasets are distributed to the 2 processes (one by GPU), not the batches (Sequence Length of one batch in the Dataloaders = 1024). Then, the batches are created on each GPU but without taking care of the Sequence Length of 1024. As the datasets are a concatenation of texts with different length, torch.stack() can not process them.

The question is why the Dataloaders is not used at the process level when the mode is DDP in fastai v2?